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This  dissertation  presents  a conversion  of  paper-based  documentation  to 
computerized  form.  The  main  process  for  such  conversion  is  considered  as  computer 
understanding  of  the  document  image  which  is  a visual  representation  of  a two- 
dimensional  field  consisting  of  blocks  of  text,  graphics,  and  picture  images.  In 
designing  this  document  analysis  system,  automatic  block  segmentation  and 
classification  of  a digitized  document  image  are  necessary  stages.  For  the  automatic 
block  segmentation,  we  have  developed  a robust  approach  which  connects  black 
pixels  within  the  predetermined  distance  to  separate  the  blocks.  The  segmentation 
procedure  is  performed  as  a top-down  approach  to  reduce  the  processing  time. 

For  the  development  of  a block  classification  algorithm  insensitive  to  skew, 
the  block  classification  rule  based  on  black  pixels  is  considered  as  a way  to  solve  this 
problem.  This  method  uses  a step-by-step  classification  approach  to  avoid  exhaustive 
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classification  procedure.  The  ratio  between  a black-white  transition  count  and  the 
count  of  black  pixels  of  each  block  is  used  as  one  of  measurements.  This  ratio  is 
almost  invariant  to  skew  and  is  constantly  high  for  the  text  block.  The  further  pixel- 
based  operator  classifies  each  block  in  detail. 

For  a system  that  not  only  recognizes  the  text  block  but  also  understands  a 
nontext  region,  we  have  utilized  and  integrated  advanced  technologies  developed  in 
various  methodologies.  In  the  stage  for  understanding  nontext,  we  have  distinguished 
some  of  the  symbols  from  the  other  picture  images.  We  have  divided  the  symbols 
into  two  different  categories.  Two  different  image  processing  techniques,  such  as 
thinning  and  finding  boundary,  are  applied  to  line  and  blob  type  symbols  respectively 
in  order  to  extract  valuable  features.  We  have  used  a geometrical  feature  for  line 
type  symbols  and  applied  a weighted  graph  matching  method  for  identification. 
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CHAPTER  1 
INTRODUCTION 


A document  analysis  system  which  converts  a paper-based  documentation  to 
computerized  form  involves  a two-dimensional  information  processing  task.  Such  a 
system  must  recognize  characters  of  a text  block  and  identify  nontext  regions  such  as 
line  drawings  and  pictures.  A human  understands  a page  image  by  applying  sufficient 
knowledge  to  it.  Documents  with  nontext  are  becoming  more  common.  The  trend  is 
to  have  many  more  documents  composed  of  text,  graphics,  and  pictures.  In  most 
documents  some  pages  have  pictures  because  editors  have  learned  that  all-type  pages 
generally  have  little  eye  appeal  and  pictures  help  the  reader  to  understand.  When  no 
photos  or  other  illustrations  pertinent  to  the  article’s  subject  are  available,  many 
editors  will  try  to  use  a picture  that  is  germane  to  the  subject  and  helps  attract 
attention  to  the  article  even  if  it  adds  no  information  of  value.  It  seems  impossible 
to  give  the  machine  the  knowledge  which  humans  possess  about  a page  because  it 
is  not  known  how  the  human  eye  recognizes  the  object. 

Fortunately,  a page  image  has  a distinctive  geometric  layout  in  printed  form. 
In  the  composition  of  a page  the  first  consideration  is  a well-organized  layout  with 
proper  columns  and  margin  selections.  Graphics  and  pictures  are  well  separated  from 
the  text  block  and  other  nontext  regions.  Thus,  such  an  arrangement  of  content  on 
the  page  provides  a stepping  stone  to  designing  a document  analysis  system  to  cope 
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with  the  knowledge  explosion  that  threatens  to  bury  us  in  information  debris.  This 
system  not  only  copes  with  the  explosion  of  knowledge  but  also  can  possibly  help 
people  with  visual  handicaps.  Thus  far,  the  equipment  available  for  blind  and 
physically  handicapped  individuals  are  the  tape  recorded  book,  magnifier,  and  braille. 
Tape  recorded  books  are  produced  by  human  readers  working  with  a reading  service. 
These  reading  services  mainly  rely  upon  volunteers.  Developing  a system  for 
document  content  analysis  is  particularly  important  for  the  increasing  number  of 
elderly  people,  many  of  whom  are  visually  handicapped. 

1.1  Statement  of  the  Problem 

A computerized  form  of  paper-based  documentation  provides  some 
advantages  such  as  efficient  document  update  and  revision;  such  conversion  of  paper- 
based  documentation  to  computerized  form,  however,  requires  several  steps  including 
preprocessing  steps.  The  main  process  for  conversion  is  considered  as  computer 
understanding  of  document  images.  The  document  image  is  a visual  representation 
of  a two-dimensional  field  consisting  of  blocks  with  text  only  and  blocks  including 
nontext.  The  understanding  of  text  has  been  developed  and  is  an  extension  of 
character  recognition.  Complete  understanding  of  paper-based  documentation 
confronts  the  difficulty  of  understanding  nontext  regions.  Many  efforts  to  interpret 
engineering  drawings  and  diagrams  by  machine  can  be  found  in  the  literature 
[Hua86b,  Tou87b,  Kas90]. 
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The  preprocessing  stage  for  the  document  analysis  system  requires  the 
conversion  of  a paper-based  document  to  a digital  bit-map  representation  after 
optical  scanning,  followed  by  the  automatic  segmentation  of  the  page  image  which 
separates  each  block  by  spaces  between  them.  Two  diametrically  opposite 
philosophies  called  top-down  and  bottom-up  have  been  proposed  for  the  automatic 
segmentation  task.  Certain  global  operations  are  performed  on  an  entire  page  image 
in  the  top-down  approach,  while  character  components  are  individually  detected  and 
then  merged  together  into  progressively  larger  blocks  using  component  properties 
and  interrelationships  in  the  bottom-up  approach.  Several  approaches  such  as 
projection  profiles  or  run-length  smoothing  have  been  proposed;  these  enable 
segmenting  the  image  into  blocks,  each  of  which  can  be  subsequently  classified  using 
pattern  classifcation  techniques.  Both  projection  profile  and  run-length  smoothing- 
based  techniques  are  fast;  however,  these  are  too  rigid  and  fail  if  the  documents  or 
their  constituent  textual  blocks  are  skewed.  In  other  words,  if  any  pixel  of  the  next 
text  line  is  higher  than  or  at  the  same  height  as  any  lowest  pixel  of  the  text  line  being 
scanned,  the  document  image  will  not  be  segmented  as  being  desired  for  subsequent 
block  classification. 

Some  algorithms  have  been  reported  in  the  literature  for  text  string 
separation,  an  early  stage  in  any  document  analysis  system.  However,  many  of  these 
algorithms  are  very  restrictive  in  the  type  of  documents  they  can  process;  others  are 
robust  but  they  are  too  computationally  intensive  to  be  used  without  special  purpose 
hardware  [Fle88]. 
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1.2  Objectives 

The  objectives  of  this  research  are  first,  to  develop  a robust  and  fast  algorithm 
for  segmentation  of  the  page  and  a precise  classification  rule  for  the  block,  and 
second,  to  attempt  an  understanding  of  some  nontext  regions  such  as  pictures  or  line 
drawings.  In  order  to  achieve  these  goals,  we  divided  the  tasks  into  the  following 
subtasks: 

(1)  Development  of  fast  and  robust  algorithm  to  segment  blocks. 

To  process  the  recognition  and  analysis  of  a page,  the  page  should  be 
segmented  into  blocks  which  are  separated  by  spaces  or  lines.  The  algorithm  should 
be  fast  and  robust  for  skew. 

(2)  Development  of  an  approach  classifying  each  segmented  block. 

In  order  to  recognize  a page  with  text  and  nontext,  each  block  should  be 
classified  according  to  what  it  contains.  Nontext  should  be  classified  in  detail  to 
handle  the  document  management. 

(3)  Development  of  a systematic  and  efficient  matching  process. 

The  processing  of  visual  information  is  typically  a multi-layered  task.  The 
human  brain  mechanisms  of  vision  clearly  indicate  that  there  is  a layered  approach 
to  the  processing  of  visual  information  both  physiologically  and  psychologically. 

(4)  Design  of  a system  for  understanding  some  graphics  and  symbols. 

A page  contains  some  graphics  images,  especially  line  drawings,  to  help 
readers  understand.  Since  a typical  drawing  contains  both  text  strings  and  graphics, 
recognition  of  text  can  be  separated  from  the  understanding  of  graphics  image. 
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1.3  Approaches 

The  work  in  this  dissertation  uses  the  document  analysis  system  shown  in 
Figure  1-1.  The  primary  processing  step,  which  is  the  bulk  of  this  dissertation,  is 
illustrated  in  Figure  1-2.  The  design  of  a machine  vision  system  concerned  with 
document  analysis  involves  several  major  problem  areas.  These  are  (1)  the 
preprocessing  problem  known  as  low-level  processing;  (2)  the  image  segmentation 
and  classification  problem  categorized  as  a pre-step  for  intermediate-level  processing; 
(3)  the  feature  extraction  and  scene  analysis  problem  known  as  intermediate-level 
processing.  This  dissertation  presents  several  unique  methodologies  for  some  of  the 
problem  areas:  (1)  a block  segmentation  algorithm  which  is  robust  to  page  skewing, 
(2)  a block  classification  algorithm  invariant  to  the  change  of  block  shape,  (3)  some 
graph  theoretical  approach  to  the  recognition  of  nontext  areas  such  as  symbols 
and/or  line  drawings. 

An  elegant  algorithm  is  described  for  block  segmentation.  This  process  rapidly 
connects  each  component  by  dividing  the  document  into  blocks  separated  by  spaces. 
In  connecting  components,  the  proposed  algorithm  applies  an  operation  for  each 
black  pixel  so  as  to  generate  a linearly  expanded  contour  of  each  component.  The 
performance  of  the  algorithm  is  measured  for  the  time  required  to  execute  an 
algorithm  on  problems  of  size  N.  This  algorithm  has  a time  complexity  of  N2  like 
most  previous  top-down  approaches  for  block  segmentation;  however,  it  is  invariant 
to  skew  and  it  is  faster  than  most  previous  approaches  for  most  input  cases  even  if 
they  have  the  same  time  complexity  of  N2.  As  the  very  next  stage  to  the  block 
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Figure  1-1.  System  Overview  for  Document  Analysis. 
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Figure  1-2.  Primary  Components  of  a Document  Analysis  System. 
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segmentation,  each  block  should  be  classified  according  to  what  it  contains  and  block 
classification  rules  are  required  not  to  be  restricted  to  error-free  document  image 
data.  In  order  to  develop  an  algorithm  insensitive  to  skew,  block  classification  rules 
based  on  pixel  level  will  be  considered  as  a way  to  solve  this  problem.  Unlike 
previous  classification  rules,  this  algorithm  tries  to  extract  the  features  based  on  pixel 
level  data. 

Several  measurements  are  considered  for  developing  a block  classification 
rule.  In  order  to  apply  measurements  for  classifying  the  blocks,  each  block  should  be 
analyzed  and  considered  based  on  structural  hierarchy.  For  example,  a text  block  is 
composed  of  text  lines,  and  each  text  line  consists  of  a variety  of  characters.  The 
basic  components  of  characters  are  line  segments  of  uniform  width.  The  proposed 
pixel  operations  can  distinguish  a line  segment  from  a blob  segment  whose  skeleton 
does  not  represent  its  own  properties.  Separation  of  text  from  complex  fine  line 
drawing  is  made  by  the  removal  of  the  pure  line  segment  which  is  the  primary 
component  of  the  line  drawing.  Features  such  as  probability  of  occurrence  of  black 
pixels  in  a block,  the  number  of  black  pixels  after  applying  consecutive  operations, 
the  black-white  transition,  the  total  number  of  black  neighbors  for  each  black  pixel 
in  the  original  image  are  used  for  a new  classification  rule. 

In  the  understanding  stage,  some  symbols  stored  in  the  database  will  be 
distinguished  from  the  picture  image.  These  symbols  are  classified  into  two  types 
depending  upon  whether  the  segment  is  of  line  or  blob  shape.  Two  different  image 
processing  techniques  such  as  thinning  and  finding  a boundary  will  be  applied  to  line 
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and  blob  type  symbols  respectively  in  order  to  extract  their  features.  Geometrical  and 
topological  features  are  utilized  for  the  line  type  of  symbol.  On  the  other  hand,  the 
Fourier  descriptor  for  the  boundary,  which  is  considered  as  closed  curve,  can  be  used 
to  recognize  the  blob  type  of  symbol. 

1.4  Preview  of  Remaining  Chapters 

Chapter  2 presents  a brief  survey  of  previous  work  on  the  segmentation  of 
images  and  the  classification  of  block  segments.  We  discuss  both  (1)  the  direct 
method  of  segmenting  the  page  image  by  both  top  down  and  bottom-up  approaches, 
and  (2)  the  transformational  method  which  converts  a coordinate  space  to  another 
coordinate  space  so  as  to  eliminate  the  skew  problem. 

In  chapter  3,  we  describe  a robust  algorithm  for  block  segmentation  and  an 
unconstrained  block  classification  rule.  Since  some  of  documents  consist  of  complex 
data  such  as  text,  graphics,  and  pictures,  block  segmentation  and  block  classification 
are  required  for  designing  an  automatic  document  analysis  system.  We  also  describe 
page-structure  analysis  based  on  standard  document  architecture  and  present 
experimental  results. 

In  chapter  4,  we  describe  the  page  recognition  system  which  not  only 
recognizes  the  text  block  but  also  understands  some  of  the  nontext  regions.  The  text 
recognition  only  deals  with  the  off-line  case  in  which  the  characters  are  written 
completely  on  a sheet  or  on  any  other  materials,  because  only  page  images  are 
treated  in  this  thesis.  The  understanding  of  nontext  has  mainly  been  done  on  line 
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drawings  such  as  logic  circuit  diagrams  and  mechanical  engineering  drawings  and 
trademarks.  Prior  to  the  understanding  stage,  each  line  drawing  is  classified  into 
detailed  line  drawings  based  on  evidence  extracted  from  each  line  drawing. 

We  close  with  chapter  5.  After  documents  have  been  created,  some  sort  of 
file-management  facility  is  required  for  a user  to  collect  the  documents  into  piles  and 
to  put  these  piles  away.  When  a document  is  filed  away  in  a document-management 
system,  it  is  necessary  to  be  able  to  retrieve  it  at  some  later  time.  The  design  of  a 
complete  document-filing  and  retrieval  system  is  very  complex,  and  it  is  beyond  the 
scope  of  this  dissertation.  Chapter  5 describes  some  aspects  for  a document-filing  and 
retrieval  system  that  would  be  appropriate  for  an  extension  of  the  work  in  this 
dissertation. 


CHAPTER  2 

BACKGROUND  AND  REVIEW  OF  PREVIOUS  WORK 


In  this  chapter  we  review  previous  work  on  the  segmentation  of  page  images 
and  the  classification  of  the  segmented  blocks.  The  automatic  segmentation  of  the 
page  image  is  a significant  process  for  a document  analysis  system,  first  demonstrated 
by  Wahl,  Wong  and  Casey  [Wah82],  They  have  presented  a heuristic  approach  to  the 
problem,  by  operating  on  a binary  image  for  the  page.  As  a preprocessing  stage,  they 
convert  the  grey-scale  image  into  a binary-valued  image  format  by  comparing  the 
gray-level  values  with  a threshold  value. 

This  chapter  describes  a few  preprocessing  techniques  and  the  segmentation 
and  classification  processes  which  are  at  the  heart  of  a document  analysis  system. 
Prior  to  reviewing  previous  work  on  segmentation  of  the  page  image  and 
classification  of  the  segmented  blocks,  some  preprocessing  techniques  will  be 
described  briefly.  Before  the  document  image  can  be  processed,  a format  conversion 
is  required  because  the  image  obtained  from  a scanner  is  in  raster  format.  This  raster 
format  of  a document  image  is  converted  to  a raw  format  in  order  to  process  the 
image  data. 


2,1  Preprocessing 

Since  the  early  1960s,  much  work  has  been  done  in  image  processing, 
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especially  for  the  reduction  of  the  amount  of  computation  and  the  efficient  reduction 
of  the  errors.  Even  normal  text  documents  present  several  difficult  problems  for 
further  processing  because  of  variations  of  type  production.  Some  characters  in  the 
documents  image  are  often  smeared  or  smudged  or  sometimes  printed  with  either 
very  light  strokes  which  are  difficult  to  detect  or  very  heavy  strokes  that  tend  to 
broaden  and  run  together  when  imaged  for  a grey-level  scanner.  Furthermore  the 
amount  of  date  in  a scanned  document  is  enormous.  To  solve  these  problems,  we  will 
discuss  some  techniques  for  (1)  binary  representation  by  image  thresholding  and  (2) 
thinning  and  boundarization  for  object  description. 

2.1.1  Binary  Representation  bv  Image  Thresholding 

The  creation  of  a binary  representation  from  an  analog  image  requires  that 
we  determine  whether  a point  is  converted  into  a binary  one  or  a binary  zero 
depending  on  the  grey-level  measured  by  a scanner.  Thresholding  is  an  obvious  tool 
for  creating  the  binary  representations  from  grey  level  images.  By  judiciously 
choosing  a grey  level  threshold  between  the  dominant  values  of  the  object  and  the 
background,  the  original  grey  level  image  can  be  transformed  into  a binary  form. 
Although  the  method  appears  to  be  simplistic,  it  is  not  easy  to  find  the  threshold 
value  from  the  poor  grey  level  image  data  normally  returned  by  a scanner.  The 
scanning  hardware,  due  to  technology  and  cost  limitations,  have  nonuniform 
illumination  over  the  scan  field,  sensitivity  and  dark  current  variations  from  element 
to  element  in  the  scanning  array,  and  distorted  resolution  from  the  lens. 
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Kittler  and  Illingworth  [Kit86]  proposed  minimum  error  thresholding  which 
is  applicable  in  multithreshold  selection.  Minimum  error  threshold  uses  a histogram 
which  summarizes  the  distribution  of  the  grey  levels  in  the  image  and  gives  the 
frequency  of  occurrence  of  each  grey  level  in  the  image.  The  histogram  can  be 
viewed  as  an  estimate  of  the  probability  density  function  p(g)  of  the  mixture 
population  comprising  grey  levels  of  object  and  background  pixels.  In  the  following, 
each  of  the  two  components  p(g  | i)  of  the  mixture  is  normally  distributed  with  mean 
standard  deviation  Oj  and  a priori  probability  Pi?  i.e. 


2 

Pig)  = E PiP(8 10  (2-1) 

i= 1 

where 


i 0?  - u )2 

Pig  10  = -11-  exp( -2-)  (2.2) 

v^o,.  2o* 

For  given  p(g  | i)  and  P;  there  exists  a gray  level  t for  which  gray  levels  g satisfy 

Px  pig\\)  < P2  p(g\2)  , g z t (2.3a) 

PiPigW  > P2pig\2)  > 8 > T (2.3b) 

where  x is  the  Bayes  minimum  error  threshold  at  which  the  image  should  be 
binarized.  The  problem  of  minimum  error  threshold  selection  is  to  determine  the 
optimum  threshold  value  x in  an  estimate  of  the  probability  density  function,  which 
is  viewed  as  a histogram.  The  minimum  error  threshold  can  be  obtained  by  solving 
the  quadratic  equation  which  represents  the  condition  of  a gray  level  for  the 
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existence  of  the  thresholding  gray  level.  However,  the  parameters  of  the  mixture 
density  function  associated  with  an  image  to  be  thresholded  will  not  usually  be 
known.  The  fitting  techniques  estimates  these  parameters  from  the  gray  level 
histogram  in  order  to  get  these  parameters.  In  the  fitting  technique,  the  average 
performance  figure  for  the  whole  image  can  be  characterized  by  the  criterion 
function.  One  of  the  techniques  for  finding  the  optimum  threshold  x can  be 
summarized  as  follows:  Suppose  that  the  gray  level  data  is  thresholded  at  some 
arbitrary  level  T and  each  of  the  two  resulting  pixel  populations  is  characterized  by 
a normal  density  h(g  | i,  T)  with  parameters  /q(T),  Oj(T)  and  a priori  probability  Pj(T) 
is  modeled  as 


b 


PfT)  = E *<*) 


(2.4) 


b 


= I E *(*)»]/  P, (T) 


(2.5) 


and 


b 


o)(T)  = [ E ( * - M7)  )2  %)  1 / PfT) 


(2.6) 


where 


(2.7) 


and 
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b = { 


T 

n 


i = 1 
i = 2 


(2.8) 


where,  n is  the  gray  level  value  for  white  pixel.  Now  using  the  model  h(g  | i,  T),  i = 1,2 
the  conditional  probability  e(g,T)  of  grey  level  g being  replaced  in  the  image  by  a 
correct  binary  value  is  given  by 


e(g,7)  = h(s\i.  T)  • PfJUMg)  i = { \ ggfT  (2.9) 

An  index  of  correct  classification  performance,  e(g,T),  is  obtained  by  taking  the 
logarithm  of  the  numerator  in  (2.9)  and  multiplying  the  result  by  -2. 


e(g,7M(g-p.(7))  / of  + 21ogoJ.(7)  - 21ogPi(7)  i = { \ 8g  * * (2.10) 

The  average  performance  figure  for  the  whole  image  can  then  be  characterized  by 
the  criterion  function 


m = E *te)  • (2.ii) 

g 

This  criterion  function  reflects  indirectly  the  amount  of  overlap  between  the  Gaussian 
models  of  the  object  and  background  populations.  The  better  the  fit  between  the 
data  and  the  models,  the  smaller  the  overlap  between  the  density  functions  and 
therefore  the  smaller  the  classification  error.  The  problem  of  minimum  error 
threshold  selection  can  then  be  formulated  as  one  of  minimizing  criterion  J(T),  i.e. 

J(t)  = min  J(T) 

T 


(2.12) 
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2.1.2  Thinning  and  Boundarization  for  Object  Description 

Two  common  ways  to  describe  an  object  are  by  the  use  of  boundaries  and  of 
skeletons.  An  object  in  two-dimensional  space  is  completely  determined  if  we  know 
its  borders,  and  provided  that  we  also  know  which  side  of  each  border  is  inside  the 
object  and  which  is  outside.  If  it  has  holes,  an  object  may  have  more  than  one 
border.  A different  way  of  describing  an  object  makes  use  of  representation  by  its 
"skeleton".  Finding  the  skeleton  of  object  is  usually  called  thinning  of  image  data. 
The  thinning  algorithm  usually  is  an  iterative  edge  point  erosion  technique.  The 
purpose  of  thinning  is  to  simplify  the  boundary  image  by  reducing  it  to  its  skeleton 
without  destroying  its  geometrical  shape  and  connectivity.  In  other  words,  the 
thinned  pattern,  so-called  skeleton,  must  preserve  the  connectedness  and  shape  of 
the  original  pattern  other  than  extremely  thick  objects.  It  should  be  noted  that  the 
skeleton  of  a pattern  may  not  be  unique.  During  the  past  years,  many  algorithms 
have  been  proposed  for  thinning  [Nac84,  Pav80].  If  the  region  is  composed  of  thin 
components,  it  can  be  described  well  by  its  skeleton.  Skeletons  derived  by  the 
thinning  algorithm  keep  connectivity  of  regions. 

Thinning  usually  consists  of  iteratively  deleting  border  points,  such  that 
deletion  of  these  points  does  not  remove  end-points  and  does  not  break  the 
connectivity  of  the  pattern  or  does  not  cause  excessive  erosion.  To  prevent  excessive 
erosion,  the  end  point  cannot  be  removed.  Here,  an  end-point  is  defined  as  a dark 
point  with  at  most  one  dark  eight-neighbor.  The  eight-neighbors  of  point  p are 
defined  to  be  the  eight  points  adjacent  to  p.  Points  p0,  p2,  p4,  and  p6  are  referred  to 
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as  the  four-neighbors  of  p in  Figure  2-1.  (In  this  representation  it  is  assumed  that  the 
object  is  represented  by  a regular  grid  of  measurements,  in  equal  steps  in  both  the 
row  and  column  directions.) 


P3  P2  Pi 
P4P  Po 
Ps  P6  P? 

Figure  2-1.  A point  p and  its  neighbors. 

Most  thinning  algorithms  have  similar  operations  by  deleting  a dark  point 
from  the  pattern  if  it  satisfies  a certain  condition,  such  as  an  edge-point,  not  an  end- 
point, and  predefined  points  etc.  These  are  applied  mostly  on  a smooth  synthesized 
pattern  which  has  no  irregularities  and  no  noise.  However,  the  thinning  process  is 
very  sensitive  to  noise  on  the  binary  boundary.  A small  disturbance  of  the  boundary 
causes  the  creation  of  small  strokes  as  well  as  a disturbance  in  main  skeletons.  In  the 
boundary  image,  there  is  positive  noise  which  is  either  isolated  salt-and-pepper  noise 
or  boundary  annexed  noise.  Therefore,  we  have  to  remove  the  small  strokes  from  the 
main  skeletons  through  a refining  process  after  the  main  stream  of  the  thinning 
process.  The  thinning  algorithm  is  illustrated  as  follows: 


1.  Set  the  flag  remain  to  true. 

2.  While  remain  is  true  do  steps  3-14. 

Begin. 

3.  If  p is  1 and  not  a pixel  of  a double  line  then  do 

4.  For  j = from  0 to  7 do  step  5-6. 

5.  count  the  number  of  change  for  8-neighbors  ( c(p)  ). 

6.  count  the  number  of  dark  pixels  for  8-neighbors  ( N8(p)  ). 
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7. 

8. 

9. 

10. 
11. 
12. 

13. 

14. 

15. 

16. 


For  j = 0,  2,  4,  6 do  step  8. 

count  the  number  of  dark  pixels  for  4-neighbors  ( N4(p)  ). 
If  c(p)  is  1 and  3 < N8(p)  < 7 and  N4(p)  < 4 then  do 
set  p equal  to  0. 

If  p is  last  pixel  then  do  step  12-14. 

For  all  pixels  p of  image  data  do 

Begin 

If  A = B then  set  remain  to  true, 
else  set  remain  to  false 

End. 

Set  A to  B. 

End. 

End  of  Algorithm. 


2.2  Review  of  Previous  Work 

Since  the  late  1970s,  several  approaches  for  text  block  separation  from  mixed 
text/graphics  images  have  been  proposed  as  an  intermediate  process  for  document 
analysis  systems.  These  approaches  are  categorized  into  two  methods,  direct  and 
transformational.  The  page  is  usually  split  into  blocks  in  order  for  the  reader  to  be 
able  to  read  with  ease.  The  white  space  between  the  blocks  is  usually  wider  than  the 
space  between  text-lines.  Direct  methods  separate  the  document  image  into  several 
blocks  applying  certain  rules  to  the  image  data  directly.  The  page  image  is  segmented 
into  blocks  by  layout  structure  and  each  block  is  classified.  The  transformational 
method  converts  the  image  data  from  one  coordinate  space  to  another  coordinate 
space,  then  applies  a rule  to  it.  Such  conversion  from  one  coordinate  space  to 
another  coordinate  space  was  performed  for  designing  a robust  approach  separating 
text  block  from  mixed  text  and  graphics  images.  A transform,  which  is  known  as 
Hough  transform,  is  applied  for  separating  text  images  from  documents  with  text  and 
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graphics  images.  Here,  we  will  discuss  the  most  representative  frameworks  for  (1) 
block  segmentation  rule  for  page  images,  (2)  block  classification  rule  for  the 
segmented  block,  and  (3)  recognizing  textual  block  using  the  Hough  transform. 

2.2.1  Block  Segmentation  Rule  for  Page  Images 

Wong  et  al.  [Won82]  and  Wahl  et  al.  [Wah82]  proposed  the  Document 
Analysis  System  which  assists  a user  in  encoding  a printed  document  for  computer 
processing.  The  proposed  system  consists  of  a block  segmentation  stage  and  a block 
classification  procedure  mainly  to  analyze  the  document  image,  which  is  the  visual 
representation  of  a page.  First,  a segmentation  procedure  subdivides  the  area  of  a 
document  into  blocks,  each  of  which  should  contain  only  one  type  of  data.  Second, 
some  basic  features  of  these  blocks  are  calculated  in  order  to  classify  them  into  a 
specific  type.  At  an  early  stage  of  the  proposed  system,  they  used  a run  length 
smoothing  algorithm  (RLSA)  which  operates  on  every  dark  point  in  the  document 
image  for  a block  segmentation  rule.  A RLSA  had  been  used  earlier  to  detect  long 
vertical  and  horizontal  white  lines.  This  algorithm  had  been  extended  to  obtain  a bit- 
map of  white  and  black  areas  representing  blocks  containing  the  various  types  of 
data.  The  RLSA  operates  on  two  black  pixels  which  have  at  most  a predetermined 
number  of  contiguous  white  pixels  between  them  on  the  same  column  or  the  same 
row.  The  RLSA  is  first  applied  row-by-row  and  then  column-by-column,  yielding  two 
distinct  bit  maps.  The  two  results  are  then  combined  by  applying  a logical  AND  to 
each  pixel  location.  The  RLSA  can  detect  small  blocks  such  that  each  block  includes 
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Figure  2-2.  The  failure  of  RLSA  when  the  document  is  skewed. 
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just  a text  line  in  the  text  region.  The  RLSA  is  fast:  however,  it  fails  if  the  text  lines 
are  skewed:  it  doesn’t  extract  any  text  lines  in  spite  of  existing  text  lines  in  the  page. 
Figure  2-2  shows  the  segmented  block  of  the  document  skewed  by  1.2°.  Although  the 
document  is  skewed  by  less  than  two  degrees,  the  RLSA  does  not  generate  blocks 
of  text  lines. 

Nagy  et  al.[Nag86]  described  one  of  the  top-down  segmentation  strategies 
called  RXYC  (Recursive  X-Y  cuts).  This  approach  is  also  known  as  projection  profile 
cuts.  Printed  pages  are  conventionally  made  up  of  rectangular  blocks,  and  a page  can 
be  recursively  cut  into  rectangular  blocks.  Thus  the  document  is  represented  in  the 
form  of  a tree  of  nested  rectangular  blocks.  At  each  step  of  the  recursive  process,  the 
projection  profile,  computed  along  both  horizontal  and  vertical  is  simply  a sum  of  all 
the  pixel  values  along  that  line.  Then  division  along  the  two  directions  is 


American  Cvanamid  Co.,  Wayne,  NJ. 
has  added  a foliar  spray  application  of 
Cycocel,  a plant  growth  regulant  for  use 
on  poinsettias.  to  its  product  line.  Ap- 
proved by  the  EPA  Cydocel  can  be  used 
on  all  varieties  and  colors  of  pointsettias, 
as  well  as  on  azaleas,  geraniums  and 
hibiscus.  The  company  markets  the 
product  in  1 -quart  containers. 


FMC  Corp.’s  Agricultural  Chemicals 
Division  Philadelphia,  has  introduced  a 
liquide  formulation  of  an  insecticide- 
miticide  previously  available  only  as  a 
wettable  powder.  Designed  for  use  in 
reenhouses,  Talstar  Rowable  controls 
5 different  pests  and  leaves  little  re- 


Terraguard  50,  a product  of  Uniroyal 
Chemical  Co.  Inc.,  Middlebury,  CT,  has 
been  approved  by  the  Environmental 
Protection  Agency  for  use  nationwide. 
The  wettable  powder  controls  Cylindro- 
cladium  spathiphylli  root  and  petiole  rot 
on  Spathiphyllum  in  enclosed  struc- 
tures, such  as  greenhouses  and  shade 
houses,  and  in  interior  landscapes. 


Unocal  Corp.,  Los  Angeles,  has  intro- 
duced N-pHuric  GTO,  a fertilizer  and 
water-treatment  product.  According  to 
the  company,  the  acidic  uroa  chemistry 
of  N-pHuric  GTO  reduces  the  possibili- 
ty of  free  ammonia  formation,  increases 
macronutrient  and  micronutriant  up- 


Figure  2-3.  An  example  of  the  subdivided  page  by  RXYC. 


22 


accomplished  by  making  cuts  corresponding  to  deep  valleys  in  the  projection  profile, 
with  the  width  larger  than  a predetermined  threshold.  The  RXYC  identifies  large 
blocks  as  illustrated  in  Figure  2-3;  however  like  RLSA,  it  fails  if  the  text  lines  are 
skewed. 

In  contrast,  certain  global  operations  are  performed  on  the  entire  image  in  the 
approaches  described  thus  far.  Doster  [Dos84]  proposed  the  bottom-up  approach 
which  determines  the  individual  connected  components.  In  the  case  of  text  blocks  the 
characters  are  merged  into  words,  words  are  merged  into  lines,  lines  into  paragraphs, 
and  paragraphs  into  even  larger  blocks,  if  such  a merging  is  possible.  However,  this 
approach  requires  extensive  usage  of  memory  resources  and  is  very  slow  in  processing 
speed  [Sri86]. 


2.2.2  Classification  Rule  for  the  Segmented  Blocks 

Scherl  et  al.  [Sch80]  described  the  simple  method  for  obtaining  characteristic 
features  for  text,  graphics  and  picture  segments.  It  subdivides  the  document  into 
small,  overlapping  windows  to  generate  a histogram.  Within  each  window,  a grey 
level  histogram  is  evaluated.  Then  the  features  can  be  extracted  statistically  from  the 
histogram.  Text  consists  of  white  background  with  black  characters  on  it.  The 
background  is  almost  entirely  of  one  intensity  level  and  is  the  largest  and  brightest 
part  within  a window.  The  characters  don’t  consist  of  black  lines  but  of  many 
transitions  between  black  and  white.  Because  of  this,  a small  sharp  peak  at  a bright 
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greylevel  and  a lot  of  darker  greylevels  are  typical  of  text.  Meanwhile,  the  histogram 
of  a picture  has  no  similar  sharp  characteristics.  The  shape  of  such  a histogram 
strongly  depends  on  the  content  of  the  picture.  In  some  cases,  it  might  be  possible 
that  the  histogram  of  a picture  looks  like  the  histogram  of  text.  But  usually  the 
percentage  of  darker  levels  within  a picture  is  higher  and  often  the  brightest  points 
within  pictures  are  darker  than  the  background  of  text.  Furthermore,  if  a graphic 
consists  only  of  lines,  its  greylevel  histogram  will  not  differ  much  from  that  of  text. 
Therefore,  statistical  features  taken  from  a greylevel  histogram  seem  to  be  not 
suitable  for  discrimination  of  text  and  graphics.  The  shape  of  the  histogram  is  largely 
dependent  upon  the  size  of  the  window.  A larger  window  results  in  a weaker 
dependence  of  the  shape  of  the  histogram  on  the  position  of  the  window  within  the 
text.  However,  larger  windows  decreases  the  accuracy. 

A method  for  block  classification  was  proposed  by  Wong  et  al.  [Won82].  They 
use  block  height  and  block  mean  black  pixel  run  length  as  the  basic  features.  Several 
measurements,  such  as  total  number  of  black  pixels  in  the  segmented  image  block 
(BC),  minimum  x-y  coordinates  of  a block  and  its  x-y  lengths  (xmin,  Ax,  ymin,  Ay),  total 
number  of  black  pixels  in  the  original  image  for  the  block  (DC),  and  number  of 
horizontal  white-black  transitions  in  the  original  image  block  (TC),  are  taken  to 
classify  text  line  blocks  and  graphics  or  halftone  picture  blocks.  The  following 
features  are  measured  during  component  labeling:  (1)  the  height  of  each  block 
segment  H = Ay,  (2)  the  eccentricity  of  the  rectangle  surrounding  the  block  E = Ax/  Ay, 
(3)  the  ratio  of  black  area  to  enclosing  box  area  S = BC/(AxAy),  and  (4)  the  mean 
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horizontal  length  of  the  black  runs  of  the  original  data  from  each  block  Rm  = DC/TC. 
Some  of  the  features  are  illustrated  in  Figure  2-4. 

A block  is  considered  to  be  text  if  its  R and  H values  are  less  than  some 
constant  multiples  of  the  mean  length  of  black  run  and  mean  height.  In  other  words, 
the  pattern  classification  scheme  that  assumes  linear  separability  is  used  to  determine 
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(a)  The  shape  of  text  block 
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(b)  An  example  of  the  shape  for  the  non* text  block 

Figure  2-4.  The  typical  block  shape  of  text  line  in  the  RLSA. 

the  region  in  the  plane  of  mean  height  (Hm)  in  the  one  coordinate  and  mean  length 
of  black  run  (Rm)  in  the  other  coordinate.  The  distribution  of  values  in  the  R-H 
plane  obtained  from  sample  documents  are  observed  to  determine  the  discriminant 
function.  For  example,  text  is  the  predominant  data  type  in  a typical  office  document 
and  text  lines  are  basically  textured  stripes  of  approximately  constant  height  H and 
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mean  length  of  black  run  Rm.  Text  blocks  tend  to  cluster  with  respect  to  these 
features.  Figure  2-5  illustrates  the  distribution  of  value  in  the  Rm-H  plane.  The  text 
lines  of  a document  form  a clustered  population  within  the  range  20  < H <35  and 
2<  Rm  <8.  Low  Rm  and  H values  represent  the  regions  that  contain  text.  The 
graphic  and  halftone  images  have  high  values  of  H,  whereas  solid  black  lines  have 
high  R and  low  H value  in  the  Rm-H  plane. 


Figure  2-5.  The  distribution  of  Rm  and  H values  for  each  block  type. 

The  Hm  and  Rm  for  the  text  cluster  may  vary  for  different  types  of  documents, 
depending  on  character  size  and  font.  Furthermore,  the  text  cluster’s  standard 
deviations  o(Hm)  and  a(Rm)  may  also  vary  depending  on  whether  a document  is  in 
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a single  font  or  multiple  fonts  and  character  sizes.  These  authors  applied  heuristic 
rules  to  classify  the  blocks  regardless  the  character  size  and  font.  A variable,  linear, 
separable  classification  scheme  is  used  to  assign  the  blocks  into  the  following  four 
classes. 

Text:  if  R < X Rm  and  H < C22  X Hm 
Horizontal  solid  black  lines:  if  R > C21  X Rm  and  H < C22  X Hm 
Graphics  and  halftone  images:  if  E > 1/C23  and  H > C22  X Hm 
Vertical  solid  black  lines:  if  E > 1/C23  and  H > C22  X Hm 
The  constants  C(J  are  determined  heuristically  by  examining  the  R-H  plane 
plot  of  typical  documents  and  the  values  of  Rm  and  Hm.  They  have  assigned  some 
values  to  the  parameters  based  on  several  training  documents.  Although  prior 
knowledge  about  the  structural  characteristics  of  a newspaper  can  be  used  for 
classifying  blocks,  in  some  cases  these  features  will  lead  to  classification  errors.  For 
example,  the  geometric  characteristic  that  a text  line  has  approximately  a given 
constant  height  could  be  used  for  deciding  that  a block  is  a text  line.  But  if  the  image 
was  skewed  while  digitizing  it,  some  text  lines  would  be  linked  together  by  the  block 
segmentation  procedure,  then  linked  text  lines  would  be  classified  into  the  graphic 
and  half  tone  categories. 

In  Wahl’s  work  [Wah83],  a distance-mapping  function  based  on  a border-to- 
border  distance  of  the  block  to  be  classified  is  used  for  shape  discrimination.  The 
new  distance  is  a function  of  the  two  Cartesian  coordinates  x,y  { (x,y)eS  } and  an 
angle  $ measured  from  the  x axis  { 0°^<Jx  180°  }.  It  is  defined  as  the  length  of  a line 
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segment  B^  with  angle  <t>,  which  connects  an  inner  point  p(x,y)  of  S with  two 
opposite  border  points  Bj,  B2  of  S,  such  that  the  line  segment  BjB2  is  entirely  inside 
S (Figure  2-6).  The  minimum  and  the  maximum  line  segment  length  at  any  point 
(x,y)  is  defined  as  Dmin(x,y)  and  Dmax(x,y)  over  all  possible  angles  4>  respectively.  In 
addition,  an  eccentricity  mapping  Decc(x,y)  is  defined  to  be  Decc(x,y)  = 
Dmax(x>y)/Dmin(x>y)  = maxw[d(x,y,<J))]  /minw  [d(x,y,<J>)].  Similarly,  dmin,  dmax,  and  decc 
are  defined  as  the  average  values  of  Dmin(x,y),  Dmax(x,y),  and  Decc(x,y)  over  all  the 
pixels  in  the  connected  component. 

Simple  shape  factors  fj  and  f2  which  are  derived  from  dmin  and  dmax  as 
f j = Cj  • A/d2min,  f2  = c2  • A/d2max  respectively  can  be  used  as  features  to  discriminate  text, 
graphics,  and  thresholded  gray-level  pictures.  In  two  shape  factors  f2  and  f2,  A is  the 
number  of  pixels  in  discrete  space  and  cl5  c2  are  constants  determined  by 
experimentation.  Text  has  large  fj  value  and  graphics  have  relatively  large  fj  value 
too.  However,  text  has  large  f2  value  compared  to  graphics  and  thresholded  gray-level 
pictures,  while  thresholded  gray-level  pictures  have  a small  fj  compared  to  the  other 
two  types  of  images. 

Wang  et  al.[Wan89]  adopt  the  statistical  texture  analysis  for  discriminating 
document  image  categories.  Their  statistical  approach  to  texture  analysis  has  two 
basic  stages:  (1)  a series  of  intermediate  matrices  which  are  computed  from  the 
image  region  and  (2)  a set  of  features  which  are  computed  from  these  intermediate 
matrices.  For  the  intermediate  matrices,  they  use  the  BW  matrix  and  BWB  matrix, 
which  are  a set  of  consecutive  black  pixels  followed  by  a set  of  consecutive  white 
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Figure  2-6.  Illustration  of  the  distance  mapping  function  d(x,y,4>). 
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pixels  and  a set  of  consecutive  black  pixels  respectively.  These  matrices  are  illustrated 
in  Figure  2-7.  The  length  of  the  run  is  the  number  of  pixels  in  the  run.  A black-white 
pair  run  is  categorized  into  nine  classes  (i.e.  T9)  depending  upon  the  proportions  of 
white  pixels.  The  category  number  represents  the  percentage  of  white  part  in  a black- 
white  pair  run.  In  matrix  element  p(i,j)  which  specifies  the  number  of  times  that 
image  contains  a black-white  pair  run,  i means  10  * i percentage  of  white  pixel  in  the 
black-white  pair  run  of  length  j.  For  example,  the  matrix  element  p(3,10)  stands  for 
that  length  of  black-white  run  is  10%  and  percentage  of  white  pixels  is  30. 

Meanwhile,  a black-white-black  combination  run  is  defined  as  a pixel 
sequence  in  which  two  black  pixel  runs  are  separated  by  a white  pixel  run  as  shown 
in  Figure  2-7  (b).  The  length  of  a black  pixel  run  is  fixed  and  assigned  into  three 


Figure  2-7.  Definition  of  BW  and  BWB  matrices. 
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categories  depending  upon  the  predefined  arrangement.  The  matrix  element  p(i,j)  is 
the  number  of  times  that  the  image  contains  a black-white-black  combination  run, 
in  the  horizontal  direction,  with  white  pixel  run  length  j and  black  pixel  runs  with 
length  lying  in  categories  i. 

In  order  to  create  a three-dimensional  feature  space  that  can  distinguish  all 
the  blocks  in  document  image,  these  authors  also  derive  two  features  Fj  and  F2  from 
the  BW  matrix,  and  a feature  F3  from  BWB  matrix.  These  features  are  defined  as 
follows, 

(1)  Short  Run  Emphasis 


Nc  N,  Nc  Nr 

Fi  = E E ( pOV)  / j1 ) / E E pw 

i=l  j=\  i= 1 7=1 


(2.13) 


In  short  run  emphasis,  the  matrix  element  p(i,j)  is  the  (i,j)-th  entry  in  the  given  run 
length  matrix,  Nc  is  the  number  of  different  kinds  of  pixel  runs,  and  Nr  is  the  number 
of  different  run  lengths  that  occur.  The  value  of  Fj  for  small  letters  is  larger  than  the 
value  of  Fj  for  large  letters,  because  white  spaces  between  strokes  in  small  letters  are 
smaller  than  those  in  large  letters.  Meanwhile,  the  second  feature  is  defined  as 
follows: 

(2)  Long  Run  Emphasis 


f2  = E E j2  pVj)  / E E pw 

i=  1 y=l  i=l  7=1 


(2.14) 
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The  long  run  emphasis,  on  the  contrary,  has  a larger  value  for  large  letter  blocks 
than  small  letter  blocks. 

A third  feature  derived  from  the  BWB  matrix  is  extra  long  run  emphasis. 
(3)  Extra  Long  Run  Emphasis 

Nr  Ne  Nr  Nc 

F,-  E ;2  < E p'm  > / E E p'(W  (2-15) 

l-T,  i-1  1-T,  1.1 


where 


P(iJ) 

0 


if  P(ti ) > T2 
if  p{ij)  ± T2 


(2.16) 


In  extra  long  run  emphasis,  threshold  Tj  is  set  to  delete  short  run  lengths  because 
only  very  long  run  lengths  are  needed  to  express  the  characteristics  of  graphics 
blocks.  The  threshold  T2  is  for  deleting  the  effect  of  small  values  of  p(i,j)  because  a 
long  run  appears  occasionally  in  letters  blocks  and  photograph  blocks.  Thresholds  Tj 
and  T2  are  determined  by  experimentation.  The  three  features  were  measured  for 
several  different  type  of  the  sample  blocks.  From  the  F,-F2  feature  space,  the  blocks 
with  different  type  of  document  image  are  clustered  together  within  each  class  and 
are  well  separated  between  classes,  except  for  graphics  blocks.  The  feature  F3, 
however,  separates  graphics  blocks  from  the  other  types  of  blocks. 
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2.2.3  Line-to-Point  Transformation 

The  transformation  of  a line  in  Cartesian  coordinate  space  to  a point  in  polar 
coordinate  space  was  developed  by  Hough.  A straight  line  is  described  in  Figure  2-8 
(a)  as  p=x  cos0  + y sin0  where  p is  the  normal  distance  of  the  line  from  the  origin 
and  0 is  the  angle  of  the  origin  with  respect  to  the  x axis.  The  Hough  transform  of 
the  line  is  simply  a point  with  coordinate  (p,0)  in  the  polar  domain  (Figure  2-8  (b)). 
A family  of  lines  passing  through  a common  point  (Figure  2-8  (c))  maps  into  the 
connected  set  in  the  polar  domain  (Figure  2-8  (d)). 

The  connected  set  for  a family  of  lines  passing  through  point  A in  Figure  2-8 
(e)  will  be  in  the  top  of  Figure  2-8  (f),  and  the  connected  set  for  point  B is  drawn 
as  middle,  and  for  point  C is  drawn  as  bottom.  These  connected  sets  meet  at  (po,0o) 
in  Figure  2-8  (f).  This  occurs  since  three  points  in  Cartesian  coordinate  space  are 
collinear. 

2.2.4  Recognizing  Textual  Blocks  Using  the  Hough  Transform 

The  Hough  transform  has  found  numerous  applications  such  as  detecting  lines, 
curves  in  pictures,  handling  multi-valued  images  etc.  Rastogi  et  al.  [Ras86]  applied 
the  Hough  transform  for  document  analysis.  The  Hough  transform  technique  detects 
the  presence  of  a parametrically  representable  group  of  points  in  an  image,  such  as 
a straight  line  or  a circle  through  a mapping  to  a parameter  space.  They  utilized  the 
fact  that  pages  consist  of  straight  components;  e.g.,  text  lines  are  straight.  If  we  use 
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Figure  2-8.  The  Hough  transform. 
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the  Hough  transform  for  document  analysis,  we  can  extract  text  lines  because 
characters  on  the  text  line  are  usually  collinear. 

The  Hough  transform  is  applied  to  the  centroid  of  the  connected  component 
in  the  Cartesian  coordinate  space  in  order  to  show  three-dimensional  view  in  the 
polar  domain.  The  accumulator  for  each  point  in  the  polar  domain  represents  the 
number  of  connected  sets  which  is  transformed  by  the  centroid  of  each  connected 
component  in  the  document  image.  An  array  of  accumulators  is  set  up  by  quantizing 
the  value  of  p and  0.  For  a 512  by  512  image,  the  possible  range  of  p is  -362  to  362. 
This  value  is  arrived  at  by  assuming  the  origin  of  the  (p,0)  space  to  be  at  the  center 
of  the  512  by  512  image  and  hence  the  maximum  value  of  p is  256/2  = 362  and  the 
range  of  0 is  0°  to  180°.  Code  for  the  algorithm  is  as  follow: 

For  all  centroids  of  components 
For  0 = 0 to  180  degrees 

{ p = x cos0  + y sin0 
accumulator  [p,0]  = accumulator  [p,0]  + 1 

} 

The  above  states  that  the  Hough  transform  is  applied  to  all  the  significant  point  in 
row  x,  and  column  y.  For  a 512  by  512  image,  the  maximum  value  of  the  normal 
distance  p from  the  center  of  the  512  by  512  image  is  362  where  x is  256  and  y is 
256,  i.e.  p = 256  cos  45°  + 256  sin  45°. 

The  Hough  transform  for  document  analysis  converts  2-D  page  images  to  3-D 
images.  The  valleys  of  3-D  images  transformed  from  2-D  page  images  separate  the 
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blocks.  This  mapping  is  one-to-many  in  either  direction,  and  among  the  various 
properties  which  hold  true  for  this  transformation  are  (1)  a point  in  the  document 
image  corresponds  to  a sinusoidal  curve  in  the  parameter  plane,  (2)  a point  in  the 
parameter  plane  corresponds  to  a straight  line  in  the  document  image,  (3)  points 
lying  in  the  same  straight  line  in  the  document  image  correspond  to  a curve  through 
a common  point  in  the  parameter  plane,  and  (4)  points  lying  on  the  same  curve  in 
the  parameter  plane  correspond  to  lines  through  the  same  point  in  the  document 
image. 

Fletcher  et  al.  [Fle88]  described  an  algorithm  robust  to  changes  in  text  font 
style  and  size  within  an  image.  The  algorithm  uses  simple  heuristics  based  on  the 
characteristics  of  text  strings.  This  segmentation  algorithm  is  based  on  grouping 
collinear  connected  components  of  similar  size  and  does  not  recognize  individual 
characters.  In  order  to  accomplish  these  tasks,  the  algorithm  consists  of  five  steps:  (1) 
connected  component  generation,  (2)  area/ratio  filter,  (3)  collinear  component 
grouping,  (4)  logical  grouping  of  strings  into  words  and  phrases,  and  (5)  text  string 
separation.  In  the  analysis  of  block  structure  for  the  3-D  document  image,  blocks  are 
easily  accessed  by  looking  at  rectangular  chunks  cut  across  by  the  true  orientation 
lines  and  their  perpendiculars.  The  authors  analyzed  number  of  rows  and  the  width 
of  the  transitions  to  classify  the  blocks.  They  also  use  several  properties  such  as  size 
of  the  rectangles,  their  eccentricity,  orientation,  texture  and  complexity. 
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2.2.5  Conclusions  from  Previous  Work 

As  discussed  earlier,  the  document  analysis  system  consists  of  two  activities 
such  as  block  segmentation  and  block  classification.  Most  of  early  work  tried  to 
separate  the  page  image  from  the  bit-map  of  the  page  by  a direct  method  [Won82, 
Nag86].  The  RLSA  has  generated  every  text  line  as  a block,  in  other  words,  many 
blocks  with  almost  the  same  height  of  text  lines.  Thus,  RLSA  generates  too  many 
blocks;  it  has  furthermore  another  shortcoming  called  the  skew  problem  [Bai87].  If 
text  lines  are  scanned  with  skew  even  less  than  a few  degrees,  that  is,  vertically  any 
part  of  the  highest  letter  in  the  next  text  line  is  at  the  same  height  or  above  the 
lowest  pixel  of  the  text  line  being  scanned,  then  the  RLSA  cannot  generate  the  block 
with  the  height  of  the  text  line. 

An  approach,  based  on  the  observation  that  documents  generally  have 
rectangular  block  structures,  used  a projection  profile  to  segment  the  block  [Nag86]. 
The  projection  of  the  black  pixel  counts  along  the  horizontal  and  vertical  directions 
was  used  for  block  segmentation  [Zen85,  Mas85].  The  projection  method  is  very 
sensitive  to  document  skew  with  respect  to  the  raster-scanning  direction  of  the 
scanner.  It  produces  satisfactory  results  only  for  documents  with  rectangular  block 
structure.  To  overcome  this  problem,  skew  detection  should  be  done  by  iteratively 
examining  small  angle  deviations  from  the  normal  direction  to  determine  which  angle 
gives  the  steepest  variation  of  the  projection  profile  [Mas85]. 

Rastogi  and  Srihari  [Ras86]  applied  the  Hough  Transform  for  document 
analysis  in  order  to  solve  the  skew  problem.  This  method  is  invariant  to  skew; 
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however,  it  requires  much  computation  time  for  the  preliminary  steps  and  extensive 
usage  of  memory  resources.  It  also  is  highly  CPU  intensive  and  consequently  is  too 
slow  to  be  applied  for  document  analysis  without  support  of  special  hardware. 

Most  of  the  previous  work  for  block  classification  was  developed  under  the 
assumption  that  documents  were  digitized  without  any  skew.  The  classification 
scheme  in  [Won82]  uses  block  height  and  block  mean  black  pixel  run  length,  and  it 
requires  that  block  height  should  not  be  taller  than  the  height  of  the  highest  letter 
in  that  text  line.  However,  if  the  page  is  scanned  with  even  a few  degrees  of  skew, 
some  text  lines  will  be  stuck  together.  Thus,  the  height  of  stuck  text  lines  leads  to  the 
failure  of  the  block  classification  rule. 

Wang  and  Srihari  [Wan89]  considered  that  an  image  region  possessed  a 
certain  texture  if  it  had  some  basic  subpatterns  which  occur  repeatedly  according  to 
some  specific  rules  of  arrangement.  Two  matrices,  BW  matrix  and  BWB  matrix,  are 
used  to  represent  the  textual  characteristics  of  a newspaper  image  block.  In  addition 
to  those,  three  feature  definitions  such  as  (1)  short  run  emphasis  (SRE),  (2)  long  run 
emphasis  (LRE),  and  (3)  extra  long  run  emphasis  (ELRE)  were  measured  for  several 
sample  blocks  segmented  from  the  newspaper  image.  From  these  three  feature 
definitions,  two-dimensional  SRE-LRE  space  and  three  dimensional  feature  space 
created  by  SRE,  LRE,  and  ELRE  are  used  to  distinguish  between  the  different  types 
of  blocks.  Although  prior  knowledge  about  the  structural  characteristics  of  a 
newspaper  can  be  used  for  classifying  blocks,  in  some  cases  these  features  will  lead 
to  classification  errors.  For  example,  the  geometric  characteristic  that  a text  line  has 
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approximately  a given  constant  height  could  be  used  for  deciding  that  a block  should 
be  a text  line.  But  if  the  image  was  skewed  while  digitizing  it,  some  text  lines  would 
be  linked  together  by  the  block  segmentation  procedure,  and  then  linked  text  line 
may  be  classified  into  the  graphic  and  half  tone  categories. 


CHAPTER  3 

MACHINE  UNDERSTANDING  OF  STRUCTURES  IN  DOCUMENTS 

3.1  Introduction 

Automatic  block  segmentation  and  classification  of  a digitized  document 
image  are  necessary  elements  of  a document  analysis  system  capable  of 
understanding  a document  consisting  of  a mixture  of  text  and  graphics  images.  Such 
block  segmentation  can  be  done  by  element,  text  line  or  relatively  big  paragraphs 
separated  by  wide  white  space.  The  block  segmentation  by  text  line  has  been 
described  by  run  length  smoothing  algorithm  [Wah82],  The  method  of  recursive  X-Y 
cuts  [Nag86]  generates  blocks  with  bigger  sizes,  obtained  from  the  projection  profile. 
However,  both  algorithms  require  that  documents  be  placed  without  skew.  The 
Hough  transform  has  been  used  to  design  a system  which  is  very  insensitive  to 
skewed  document  and  separated  text  string  from  mixed  text/graphics  images  [Fle88]. 
However,  this  robust  algorithm  is  so  CPU  intensive  that  it  may  require  special 
purpose  hardware  for  acceptable  response  times. 

The  failure  of  block  segmentation  due  to  skewed  document  image  data  not 
only  separates  the  document  inappropriately  but  also  induces  misclassification  of  the 
blocks.  In  this  chapter,  we  will  address  the  development  and  implementation  of  a 
robust  algorithm  for  automatic  separation  and  analysis  of  text,  graphics,  and  halftone 
images.  This  new  algorithm  for  block  segmentation  divides  the  page  by  white  spaces 
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which  separate  the  blocks  despite  the  skew  of  the  document.  Then,  the  segmented 
blocks  will  be  classified  by  a classification  scheme  which  is  not  effected  by  rotation 
of  the  block. 

In  the  understanding  stage,  the  document  analysis  system  not  only  recognizes 
the  text  blocks  but  also  understands  a nontext  area.  Such  a system  requires  advanced 
capability  which  can  analyze  an  object  and  synthesize  the  technologies  developed  by 
various  methodologies.  The  complete  understanding  system  for  a two-dimensional 
page  image  with  printed  character  and  graphics  is  the  integration  of  work  done  on 
several  problem  areas.  This  system  not  only  encodes  each  block  using  different  types 
of  data  format  to  reduce  the  memory  size,  but  understands  the  documents  as  well. 

3.2  The  Block  Segmentation  of  Digitized  Documents 

In  this  section,  we  will  discuss  the  block  segmentation  which  is  a procedure 
that  subdivides  the  area  of  a digitized  document  into  blocks  in  order  to  process  the 
document  images  systematically.  Each  of  the  blocks  ideally  is  required  to  contain  only 
one  type  of  image  data.  Such  block  segmentation  of  document  image  data  should  be 
done  by  certain  rules.  As  we  described  earlier,  it  can  be  done  by  an  element,  a text 
line,  or  relatively  big  blocks.  The  previous  block  segmentation  approaches,  such  as 
run  length  smoothing  algorithm  and  recursive  X-Y  cuts,  also  called  projection  profile 
cuts,  require  the  document  to  be  placed  without  skew.  The  block  segmentation 
algorithm  will  be  evaluated  by  a few  aspects  such  as  (1)  time  complexity,  (2) 
robustness,  and  (3)  block  size. 
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The  document  image  can  be  segmented  into  blocks  by  two  different  methods, 
top-down  and  bottom-up.  In  the  top-down  approach,  certain  global  operations  are 
performed  on  the  entire  image.  In  the  bottom-up  approach,  on  the  other  hand,  all 
the  components  in  the  document  images  are  individually  detected  and  then  merged 
together  into  larger  blocks.  We  will  address  a new  approach  for  block  segmentation 
belonging  to  the  top-down  approach.  This  new  approach  connects  each  component 
to  generate  bigger  connected  components  with  appropriate  size.  To  generate  the 
connected  component,  we  will  apply  the  operator  defined  as  follows: 

Definition  1.  Let  P(x,y)  be  a picture  element  at  location  x and  y in  the 
document  D.  Define  P(x,y)  = l as  a black  picture  element,  0 as  a white  picture 
element. 

Definition  2.  The  operator  OPj  executes  the  following  operation.  If  P(x,y)  = 1, 
then  OP1(P(x,y))  generates  picture  elements  such  that  P(x+  a,y+6)  = 1,  for  every  a,fk 
I d I,  where  a and  6 represent  the  row  and  column  locations  and  d is  a 
predetermined  distance. 

The  operator  OP,  with  the  predetermined  distance  is  applied  to  every  black 
pixel  of  the  document  image  data  and  then  expands  each  black  pixel  to  connect  every 
black  pixel  in  a certain  area  which  can  be  separated  from  other  areas  intuitively.  The 
predetermined  distance  is  set  by  the  value  which  is  larger  than  half  the  distance 
between  text  lines  and  less  than  half  the  distance  between  blocks.  Normally,  the 
space  between  the  blocks  is  wider  than  the  space  between  the  text  lines.  Of  course, 
this  is  not  a rule  all  the  documents  have  to  abide  by.  However,  we  will  exclude  the 
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document  with  poor  layout  structure  that  violate  the  above  condition.  Table  3-1. 
shows  an  illustration  of  the  space  distance  between  the  blocks  for  several  documents. 


Table  3-1.  Illustration  for  the  space  distance  between  blocks. 


Block  Pairs 

Distance 

Text  : Text 

4-9 

Text (Bold  face)  : Text 

4-10 

Text  : Text (Bold  face) 

5-10 

Text  : Nontext 

2-17 

Nontext  : Text 

3-28 

Caption (Nontext)  : Nontext 

4-7 

Nontext  : Caption (Nontext) 

3-14 

< Unit  = distance  between  text  lines  > 


3.3  Labelling  of  Segmented  Blocks 

To  identify  each  block  separated  by  the  block  segmentation  process,  labels 
have  to  be  assigned  to  different  blocks  for  subsequent  procedures  such  as  block 
classification  and  feature  extraction.  This  labelling  process  treats  the  individual 
connected  components  of  a set  of  S as  separate  objects.  S is  a bit  map  representation 
of  the  scanned  document  page.  Each  component  of  S has  a value  of  1,  for  black 
pixels,  and  a value  of  0 for  white  pixels. 
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We  can  label  the  components  of  S by  performing  a row-by-row  scan.  As  the 
first  line  is  scanned  from  left  to  right,  the  label  1 is  assigned  to  the  first  black  pixel. 
This  label  assigning  process  is  propagated  repeatedly,  in  other  words,  subsequent 
adjacent  black  pixels  are  labeled  with  l’s  until  the  first  white  pixel  is  encountered. 
The  next  black  pixel  along  the  line  is  labeled  with  2 and  so  are  its  adjacent  neighbor 
black  pixels.  This  is  continued  until  the  end  of  the  first  line  is  reached.  For  each 
black  pixel  in  the  second  line,  the  neighborhood  in  the  previously  labeled  line  is 
examined  along  with  the  left  neighborhood  of  the  pixel.  The  two  upper  diagonal 
neighbors  of  each  black  pixel,  already  visited  by  the  scan,  are  examined  in  order  to 
label  them  with  the  same  number  if  they  are  black  pixels.  If  all  eight  neighbors  are 
0,  the  current  pixel  P gets  a new  label,  that  is,  if  a black  pixel  has  no  labeled 
neighborhood  the  next  label  not  yet  assigned  is  assigned  to  this  pixel.  This  labelling 
process  is  illustrated  in  Figure  3-1  (a).  If  one  of  them  is  1,  pixel  P gets  the  same 
label;  if  two  or  more  of  them  are  1,  pixel  P gets  one  of  their  labels  and  the 
equivalences  are  noted.  This  procedure  is  continued  until  the  bottom  line  of  the 
binary  image  is  reached.  The  equivalences,  i.e.  some  adjacent  black  regions,  may  be 
labeled  differently.  The  equivalent  pairs  of  equivalences  are  sorted  into  equivalences 
classes,  and  a label  to  represent  each  class  is  picked  up  (see  Figure  3-1  (b)). 

3.4  The  Unconstrained  Block  Classification  Rule 
Some  of  the  simplest  patterns  that  people  can  recognize  without  difficulties 
are  very  hard  for  a computer  to  detect.  A human  can  classify  the  blocks  in  a 
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document  instantaneously  even  though  a text  block  consists  of  letters  which  are  not 
known  to  us.  Without  recognizing  each  character,  the  human  being  has  an  ability  to 
classify  the  block  of  text  in  documents  if  he(she)  is  educated  enough  to  figure  out 
what  most  characters  look  like.  A few  research  works  describe  the  block  classification 
of  mixed  text/graphics  images  assuming  that  the  documents  are  scanned  without 
skew.  Since  in  most  cases  the  document  image  is  segmented  with  skew,  the 
classification  rule  should  be  able  to  classify  the  blocks  regardless  of  the  shape  of  the 
blocks  because  the  shape  of  blocks  will  change  when  the  document  is  scanned  with 
skew. 

The  features  of  each  type  of  block  need  to  be  scrutinized  in  order  to  generate 
the  classification  rule.  In  the  human  visual  mechanism,  seeing  is  known  to  involve 
processing  an  enormous  amount  of  data.  Part  of  the  shock  of  making  a deep  analysis 
of  the  vision  process  comes  from  the  realization  of  how  much  information  the  human 
brain  processes  in  the  act  of  seeing.  The  brain  keeps  a temporary  record  of  the 
sensory  input  during  perception  [Sow84],  A visual  icon  is  stored  in  the  brain  for  just 
a fraction  of  a second.  When  the  brain  receives  a new  sensory  icon,  it  must  search 
its  stock  of  percepts  to  find  ones  that  match  parts  of  the  icon.  The  cerebral  cortex 
stores  the  percepts,  but  other  parts  of  the  brain  may  control  the  actual  searching  and 
comparing.  The  brain  has  also  an  associative  mechanism  which  retrieves  the  pattern 
that  matches  best,  while  an  ordinary  computer  retrieves  data  by  an  address  in 
storage.  Perception  finds  percepts  that  match  the  overall  pattern  of  an  icon  before 
it  fills  in  percepts  for  the  detail.  It  is  therefore  impossible  to  simulate  the  human 
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(a) 


Figure  3-1.  The  algorithm  for  component  labelling. 
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Figure  3-1  ( Continued  ) 
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visual  mechanism  completely  using  current  computer  technology.  Some  features  for 
the  block  classification  will  be  extracted  in  a simple  way  in  order  to  reduce  the 
enormous  computation  time  of  visual  processing. 

3.4.1  The  Ratio  of  the  Number  of  Black  Pixels  and  Black- White  Transitions 

Several  measurements,  such  as  the  total  number  of  black  pixels  in  the  original 
image  of  the  block  and  the  number  of  horizontal  black-white  transitions  in  the 
original  image  block,  etc.,  are  considered  in  order  to  distinguish  text  blocks  from 
graphics  image  blocks.  A ratio  of  the  total  number  of  black  pixels  and  black-white 
transitions  for  a certain  block  can  represent  the  feature  for  block  classification.  Table 
3-2  shows  the  processing  results  of  documents  containing  text  and  graphics  images 
when  the  documents  are  scanned  at  240  dots  per  inch  (dpi).  The  processing  results 
of  a document  scanned  at  100  dpi  are  shown  in  Table  3-3. 

From  Tables  3-2  and  3-3,  the  ratio  between  the  total  number  of  black  pixels 
and  black-white  transitions  shows  different  ranges  of  values  for  different  types  of 
blocks.  The  ordinary  text  block,  and  half  of  graphics  and  halftone  images  in  Table 
3-2,  have  a ratio  of  around  30%  when  the  documents  are  scanned  at  240  dpi,  while 
the  text  block  with  bold  faced  letters  and  bigger  type  has  a smaller  ratio  than  that 
of  the  ordinary  text  block.  As  expected,  ratios  will  be  higher  when  scanned  at  lower 


dot  resolution. 


48 


Table  3-2.  Processing  results  of  ratio  between  the  number 
of  black  pixel  and  black-white  transition. 


of  black  pixel 

B/W  Transition 

Ratio 

Class 

2139 

648 

30.29 

text 

3417 

1012 

29.61 

text 

3580 

1040 

29.05 

text 

4112 

1183 

28.77 

text 

1780 

528 

29.66 

text 

3544 

1038 

29.29 

text 

2911 

666 

22.88 

text 

4046 

1205 

29.78 

text 

3427 

968 

28.25 

text 

4751 

735 

15.47 

text 

3406 

996 

29.24 

text 

434 

123 

28.34 

text 

770 

226 

29.35 

text 

2469 

580 

23.49 

text 

3310 

989 

29.88 

text 

4436 

1155 

26.04 

text 

2251 

758 

33.67 

text 

4167 

1256 

30.14 

text 

3800 

1183 

31.13 

text 

1142 

396 

34.68 

text 

5394 

72 

1.33 

H.  black  lines 

5608 

170 

3.03 

H.  black  lines 

47742 

8738 

18.30 

graphics , halftone 

175114 

51527 

29.42 

graphics , halftone 

195328 

59906 

30.67 

graphics , halftone 

63103 

12598 

19.96 

graphics , halftone 

(scanned  at  240  dpi) 
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Table  3-3.  Processing  results  of  ratio  between  the  number 
of  black  pixel  and  black-white  transition. 


# of  black  pixel 

B/W  Transition 

Ratio 

Class 

517 

182 

35.20 

text 

366 

209 

57.10 

text 

412 

207 

50.24 

text 

375 

227 

60.53 

text 

408 

232 

56.86 

text 

352 

209 

59.37 

text 

383 

230 

60.05 

text 

416 

206 

49.52 

text 

390 

217 

55.64 

text 

449 

232 

51.67 

text 

482 

231 

47.93 

text 

469 

217 

46.27 

text 

471 

217 

46.07 

text 

424 

215 

50.71 

text 

400 

220 

55.00 

text 

454 

231 

50.88 

text 

320 

190 

59.38 

text 

5818 

304 

5.23 

trademark 

387 

239 

61.76 

text 

372 

234 

62.90 

text 

466 

286 

61.37 

text 

2136 

1036 

48.50 

text 

482 

1219 

39.54 

line  drawing 

66 

33 

50.00 

text 

428 

225 

52.57 

text 

454 

221 

48.46 

text 

(scanned  at  100  dpi) 
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3.4.2  Removal  of  Lines 

The  removal  of  line  segments  relies  on  being  able  to  detect  the  line 
segments.  Various  methods  for  detecting  line  segments  such  as  tracking  the  medial 
line  of  thinned  images  and  finding  the  longest  vector  have  been  reported  in  the 
literature  [Hil69].  However,  these  bottom-up  approaches  are  considered  as  too  time 
consuming  a process.  An  approach  for  removal  of  line  segments  as  a top-down 
process  can  reduce  the  processing  time.  Global  operators  are  applied  to  each  logical 
1,  which  represent  a black  pixel,  and  may  result  in  the  removal  the  line  segments, 
called  pure  line  segments.  A pure  line  segment  is  a line  segment  which  does  not 
represent  any  other  element  but  only  itself.  In  order  to  remove  the  line  segment,  we 
apply  the  operator  which  is  defined  as  follows: 

Definition  6.  Let  the  boundary  black  pixel  (BBP[x,y])  be  the  black  pixel  at 
location  x and  y,  with  up  to  7 black  pixel  neighbors.  The  operator  Bj(p)  eliminates 
the  BBP  from  the  image  data. 

The  removal  of  line  segments  are  described  as  two  different  schemes 
according  to  the  length  of  the  line  segment.  The  removal  of  all  the  line  segments 
including  characters  can  be  accomplished  by  applying  the  operator  Bls  while  removal 
of  pure  line  segments  can  be  done  as  follows.  The  linear  expansion  operator  OPj  is 
applied  to  each  block  pixel  until  near  line  segments  conglomerate.  Then  apply  the 
operator  Bj  which  removes  the  boundary  pixels  to  each  black  pixel  until  the 
disappearance  of  line  segments  not  already  conglomerated. 
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3.4.3  The  Procedures  of  Block  Classification 

Since  the  document  image  is  digitized  with  skew  in  most  cases,  the 
classification  rule  which  is  invariant  to  skew  is  desirable  for  the  classification  of  the 
blocks.  The  procedures  for  block  classification  of  documents  can  be  informally 
described  as  follows. 

1.  Estimate  the  total  number  of  black  pixels. 

2.  Count  the  black-white  transitions  (B/W  TC)  of  the  block. 

3.  Estimate  the  ratio  between  the  total  number  of  black  pixels  and  the  black-white 
transitions  (R). 

4.  Remove  all  the  line  segments,  measure  the  total  number  of  black  pixels.  If  the 
value  above  is  almost  same,  the  block  is  symbol,  otherwise  picture  with  blob 
image. 

5.  When  the  documents  are  scanned  at  100  dpi  (S  = 100),  both  text  and  some  of  the 
complex  line  drawing  satisfy  the  followings. 

EEB1[B1[D[i,j]]]  = 0 and  20x(240/S)1/2  < R < 40x(240/S)1/2 
Where  S is  the  resolution  of  scanned  document  at  dot  per  inch. 

6.  If  EEBJBjfDfij]]]^  and  B/W  TC  is  greater  than  half  the  number  of  step  2. 
Then  estimate  the  number  of  black  pixels  after  EEBjfBJB^Dfij]]]]. 

7.  If  the  number  of  black  pixel  equals  0,  then  the  block  is  text  with  large  character. 

8.  Remove  the  pure  line  segments  from  the  original  image  of  the  block  and  then 
measure  the  B/W  transition.  If  the  value  R is  the  same  as  the  original  value,  the 
block  is  text  otherwise  line  drawing. 
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9.  If  R < 20x(240/S)1/2  or  R > 40x(240/S)1/2,  and  EEB1[B1[D[i,j]]]=0  then  the 
block  is  line  drawing. 

3.5  Experiments 

3.5.1  Experimental  Images  and  Facilities 

In  this  dissertation,  we  have  selected  page  images  with  texts  and  a few 
nontexts  such  as  a designed  trademark  and  a circuit  line  drawing.  The  designed 
trademark  and  the  circuit  line  drawing  are  clearly  separated  from  text  blocks  by  wide 
enough  white  space.  The  methods  for  block  segmentation  and  classification  robust 
against  poor  placement  of  documents  on  the  scanner.  The  page  images,  therefore, 
are  scanned  with  various  skew  angles. 

Figure  3-2  illustrates  the  hardware  structure  of  the  SUN-Workstation  based 
intelligent  text  processing  system  in  which  the  algorithms  described  in  the  previous 
sections  are  implemented.  The  EasyScan  image  scanning  system  is  used  as  a page 
reader  for  the  document  analysis  system,  . This  system  (Microtex  scanner  version), 
is  hooked  up  to  a SPARC  station  1 + SUN  Workstation,  and  it  contains  a Microtex 
grayscale  or  color  scanner,  a scanner  driver  for  Sun  SPARCstations,  and  the 
EasyScan  scanning  utility  software.  DigitalPhoto,  used  for  scanning  utility,  combines 
advanced  tools  for  scanning  utility,  and  advanced  tools  to  create,  capture,  and  process 
grayscale  color  images  to  achieve  a variety  of  visual  effects.  The  EasyScan  image 
scanning  system  can  scan  8-bit  grayscale,  24-bit  true  color  images,  and  a l-bit  line  art 
up  to  400  or  600  dpi.  The  EasyScan  software  provides  intelligent  scanning  functions 
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Figure  3-2.  Page  image  acquisition  and  processing  system. 
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to  prescan,  adjust  brightness  and  contrast,  sharpen,  and  color  correct.  The  SUN 
Workstation  monitor  can  display  the  image  on  the  screen  through  image  display 
software  and  the  images  displayed  on  the  screen  can  be  printed  out  through  a 
Postscript  laser  printer  (Apple  LaserWriter  II). 

3.5.2  Experimental  Results 

Two  sample  pages  for  automatic  block  segmentation  and  block  classification 
have  been  selected.  The  first  sample  page  is  composed  of  texts,  a trademark,  and  a 
circuit  line  drawing;  the  second  sample  page  has  several  text  blocks  within  a 
relatively  small  area.  These  sample  pages  are  shown  in  Figure  3-3.  The  segmented 
block  from  the  experiment  is  illustrated  in  Figure  3-4.  The  second  page  has  been 
selected  with  several  blocks  to  show  that  the  algorithm  can  segment  the  blocks 
despite  the  skew.  It  was  scanned  from  several  different  angles.  Figure  3-5  (a)  ~ (d) 
shows  the  results  of  the  robust  block  segmentation  approach.  The  new  approach  for 
block  segmentation  shows  satisfying  results  despite  the  skewed  document  images. 
This  approach  also  generated  bigger  blocks  than  the  run  length  smoothing  algorithm. 
The  block  size  of  the  run  length  smoothing  algorithm  is  a text  line  of  digitized 
document  image.  The  ratio  of  the  total  number  of  black  pixels  and  black-white 
transitions  for  the  second  page  have  been  measured  for  the  skewed  pages  images.  As 
expected,  the  ratios  of  each  block  were  affected  little  by  the  skew.  Table  3-4  shows 
the  experimental  result  for  the  ratio  difference. 


55 


Trademarks  chanqe  with  the  times 

Coe. I jn  ioovei  !»«.*m  me  '■* 
.n-.-ii  ■md  'Mdivn.ii*  "?  in  sd 'jar*  me  -v-is 
jn  .vn  -rn  A is  A.  Ml/  lOliced  '01  ora/ 

imo** j les-JMP.s  out  o.  me  general  DutH-c 
as  .v*-'  J-.iSi  JS  me  ccouiamv  *5*  siv*«f * 
wo*  is  jmj  many  ?mer  asoects  ol  culture 
,va*.*S  mo  .vanes  mere  are  man/  -nstaoces 
An****  i trajem.H*  uses  is  tresnness  and 
.Iowa*  A.m  me  ojssace  ol  t ine  Pie'erences 
and  ait'iu-aes  gudua«v  -mange  ewerofses 
cnanae  me  media  tnrouqn  wnicn  me  trade- 
marit  •$  used  also  chanoe  m keeomq  with 
«ocai  manae  and  •ec-moioqicai  oioqress 
Thus  ■ idem .iks  too  -*  -sed  o/er  a onq  pe- 
nod  -■  • m*  alien  must  manqe  5jme  cor 
oorat  ons  nave  made  3 numoer  ol  minor 
cnanaes  tram  erne  to  t'me  wnne  omers  nave 
drasucaiiv  r-vamoed  men  irademams  there- 
by necessitating  a vigorous  puoncitv  cam 
paiqn 


Trere  jre  many  Jjojnese  enterpnses 
.yhicn  -ave  -.nanqeo  me-'  trademarks  tittle  bv 
.me  3/er  -ne  years  such  as  <ao  Soao 
.vnose  "■an  i me  moon  ias  grown  younger 
•rom  ---»  ’290s  to  me  cesent  .lav  and 
njqa  3 a *v  P-oducts  ^ cn  "as  retained  is 
cner-o  w.m  ntte  change  out  nas  cleaned  uo 
and  mcoe",,:ed  me  ove*-a*i  design 


SlIiEi:  CiilP  CiittlLHSES 

.lust as  with  Ml  M I.  -j*.  ::i*nl  us- 
age nl  Imurd  spare  is  a nriv.  r lackagmg 
amt  iuu*rr«iiiiK.*t*i»ui  icrnnolngy  tur  Mimic 
chips.  Technical  cnallenges  unhide  such 
items  as  till*  ni«*i«iimr  •,uiii|Hiinni.s.  :h«*  lorina- 
(uni nl  lln*  win*  limps  irniu *-tii|i  in Iradfounc. 
ami  the  lead  frames  themselves. 

As  packages  In-mme  thinner.  :t'M  Iwonme 
harder  to  ensure  their  rcliaiuhlv.  Ti.e  integri- 
ty nl'  I Ins  plastic  lliulNiirriitimls  the  .-ilirnn  aim 
Imdframe  will  In*  nmn*  oiffvuil  In  achieve. 
Tin*  lliiii.  sniall-uiilltne  .1- leaded  package 
(TSOI'),  which  is  i-iniintniiiv  1-mrn  thick,  is 
al Mint  In  give  wav  tn  n.Vmm  ij'i-tnill  pack- 
ages. .sometimes  ratlttl  |iajM.*r-ti*an  packages 
(HTsl.  In  Midi  a | ark  a tie,  the  Miicnn  will 
have  to  Ik.*  back -grinded  tiown  in  a thickness 
nl  alHiutS  mils,  which  is  very  difficult  without 
cracking  the  die.  After  leaving  another  d mils 
for  the  lentlfraine  and  tape,  only  about  i>  mils 
remains  for  the  molding  comjiminri.  Such  an 
application  requires  a low-viscosity  molding 
material  that  can  squeeze  itself  into  a very 
fine  aperture. 

Krniii  a process  |M*rs|H*etive.  the  impor* 
l .-tint*  nl'  stress  equali/ialimi  tv  ill  increase  as 
packages  Insulin;  thinner.  Willi  silicon  being 
made  thinner,  its  intolerance  to  being  flexed 


Dr*l 


Figure  6.  LC  comoensation  circuit 

Sample  page  # 1 


fa100*  ^ £&a 


w,,i  air 

o/w,,„  c°rr i,i  a 

y^/d/zy •>•<>„’;  '>»• 

We<table  Pn  ev'°uslv  °f  an  /n s*„  c*d « 

8r*onhoi,  Po*der  nav,,l,ob/e  e^c/de. 

i§§ 

Sample  page  #2 


f.  nascapes 

Z'n'**r 

‘r 

,nacron“<r/emn7^yornica7( ^ pCSJ 

Am'r’co„  Nu 

,,3»99J  87 


Figure  3-3.  The  sample  pages  for  the  experiment, 
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Figure  3-4.  The  segmented  block  of  the  first  sample  page. 
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(a) 


Figure  3-5.  The  processing  result  of  the  second  page,  (a)  without  skew,  (b)  skewed 
by  1.2°.  (c)  skewed  by  6.5°.  (d)  skewed  by  23.0°. 
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(b) 


Figure  3-5  ( Continued  ) 
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(c) 


Figure  3-5  ( Continued  ) 
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Figure  3-5  ( Continued  ) 
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Table  3-4.  The  ratio  between  the  number  of  black  pixel  and 
black-white  transition  for  the  document  skewed  at  various  angles. 


Skew  angle 

Block 

# of  B Pixels 

B/W  Trans. 

Ratio (%) 

# 

1 

5856 

2161 

53.98 

n ° 

# 

2 

4710 

2656 

56.39 

U 

# 

3 

6338 

3361 

53.03 

# 

4 

4392 

2463 

56.08 

# 

1 

5667 

3132 

55.27 

1.2° 

# 

2 

4497 

2637 

58.64 

# 

3 

6050 

3367 

55.65 

# 

4 

4222 

2475 

58.62 

# 

1 

5443 

3000 

55.11 

6.5° 

# 

2 

4520 

2596 

57.43 

# 

3 

6355 

3549 

55.85 

# 

4 

4652 

2581 

55.48 

# 

1 

6065 

3179 

52.42 

23.0° 

# 

2 

4751 

2714 

57.12 

# 

3 

6488 

3549 

54.70 

# 

4 

4448 

2537 

57.04 

< scanned  at  100  dpi  > 
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3.5.3  Analysis  for  the  Block  Segmentation  Approaches 

The  complexity  of  an  algorithm  can  be  measured  by  either  the  time  required 
to  execute  it  on  a problem  of  size  n (time  complexity)  or  the  memory  space  required 
for  its  execution  (space  complexity).  The  main  concern  for  the  analysis  of  most 
algorithms  is  the  time  complexity  with  relatively  large  values  of  n.  Like  most  image 
processing  work,  block  segmentation  of  image  data  in  documents  is  oriented  by 
operations  based  on  pixels.  The  number  of  pixels  in  the  document  image  is 
considered  as  a large  value.  The  complexity  of  previously  published  segmentation 
algorithms  are  mostly  0(n2).  Unfortunately,  these  limits  are  not  enough  to  evaluate 
those  algorithms  including  the  approach  presented  here.  In  order  to  evaluate  these 
algorithms,  the  detailed  time  complexity  function  is  required.  However,  finding  the 
exact  complexity  function  for  the  general  algorithm  is  almost  impossible. 

The  time  complexity  function  for  the  RLSA  is  0(n2),  and  it  can  be  estimated 
as  follows:  Let  pj  be  the  probability  of  a black  pixel  in  the  document,  and  p2  be  the 
probability  of  black  pixels  to  be  merged,  and  tr  be  the  time  required  of  reading  a 
pixel,  and  tm  be  the  time  required  for  merging  two  black  pixels  into  a continuous 
stream  of  black  pixels,  and  ts  be  the  time  required  for  setting  a numeric  value  to  a 
valuable.  Then  total  time  required  for  the  RLSA  is  estimated  as  (4tr  + 2tckl  + 
2Pi(tck2  + P2Im  + (1  - p2)ts)  + Und)02’  where  tckl  and  tck2  are  the  times  required  for 
checking  if-conditions  and  tAND  is  the  time  required  of  applying  a logical  AND 
operation  to  each  pixel  in  the  algorithm  respectively. 


63 


The  time  complexity  function  for  the  RXYC  is  also  0(n2).  However,  it  is  more 
dependent  upon  the  image  data  than  RLSA.  Let  p3  be  the  probability  of  the  necessity 
to  read  the  next  pixel,  and  r be  the  number  or  recursion  required,  and  p4  be  the 
probability  of  the  possibility  for  block  segmentation,  and  tseg  be  the  time  required  for 
finding  out  segment  of  block,  and  t^  be  the  time  required  for  checking  whether 
segmentation  is  possible  or  not.  Then  total  time  required  for  the  RXYC  can  be 
roughly  estimated  as  2(tck3  + p3tr  + p3tck4  + nt^  + (1  - q)n2  + 2p4tsegn  + 2(tck3  + 
P3tck4  + ntpos  + (1  - ri)ts)(n  - y)(n  - n)  + ...  • This  can  be  rewritten  as  2(1  + r)(tck3 
+ P3tr  + PsW  + ntpos  + (1  - h)ts)n2  + ...  , where  tck3  and  tck4  are  the  times  required 
for  checking  if-conditions  in  the  algorithm.  The  algorithm  using  the  Hough  transform 
is  invariant  to  skew;  however,  the  time  complexity  function  is  0(n3).  This  algorithm 
needs  to  generate  connected  components  and  find  out  centroids  for  each  connected 
component  and  then  must  apply  the  Hough  transform  to  each  centroid  of  each 
connected  component.  Let  a be  the  number  of  lines  passing  through  a common 
point.  Then,  the  roughly  estimated  time  complexity  function  is  ((tg  + tcent  + 
b lough  a)n)n  = t Hough ^ * tt3  + (lg  + bent)112'  where  a = a'n  and  tg  is  the  time  required 
for  generating  connected  components  and  tcent  is  the  time  required  to  find  the 
centroid  and  tHough  is  the  time  required  for  transforming  lines  passing  through  a point 
in  Cartesian  coordinate  space  to  a point  in  the  polar  coordinate. 

The  time  complexity  function  for  the  proposed  approach  is  also  0(n2)  and  is 
dependent  upon  the  operation  to  connect  each  black  pixel.  Relatively  exact  time 
complexity  can  be  estimated  as  (tr  + tckl  + p,topl)n2,  where  topl  is  the  time  required 
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for  execution  of  the  operation  which  connects  pixels  within  a certain  distance. 

The  method  using  the  Hough  transform  is  0(n3)  unlike  other  previous  approaches. 
This  means  that  the  method  using  the  Hough  transform  is  definitely  slower  than  any 
other  algorithms  as  long  as  n is  a large  value.  It  is  not  easy  to  evaluate  the  processing 
time  through  the  estimated  time  complexity  above  directly,  compared  to  the  method 
using  the  Hough  transform,  because  the  three  algorithms  above  have  the  same  0(n2) 
complexity  variation.  In  order  to  evaluate  algorithms  with  the  same  big  O,  we  have 
to  estimate  the  exact  coefficient  for  the  highest  degree  of  n.  The  RLSA  is  known  as 
a fast  algorithm.  We  are  going  to  compare  RLSA  and  the  proposed  approach. 
Roughly  estimated,  pj  is  less  than  0.1  in  the  text  area.  It  is  heavily  dependent  upon 
the  type  of  image  data  in  the  document.  Assuming  that  p2  is  around  0.5  for  most 
document  image  data,  the  proposed  approach  can  be  considered  as  faster  than  RLSA 
as  long  as  topl  is  almost  the  same  as  tm.  The  average-case  input  may  be  a good 
choice,  but  it  is  sometimes  very  hard  to  measure  effectively.  Generally  it  is  not  clear 
what  an  average  input  is.  The  worst  input  is  very  useful  in  some  cases.  However,  it 
is  also  not  easy  to  find  out  the  worst  input  for  all  the  approaches. 

3.6  Page-Structure  Analysis 

Page-structure  analysis  is  the  process  of  converting  a page  representation  to 
an  abstract  representation.  This  process,  considered  as  a higher  level  of  document 
understanding,  attempts  to  determine  an  overall  block  structure  of  a page.  The 
abstract  representation  is  the  specification  of  the  words  and  diagrams  that  make  up 
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the  printed  document  and  of  how  the  pieces  of  content  are  to  be  fitted  together  into 
a whole.  The  process  of  converting  the  abstract  representation  into  a physical 
representation  of  the  document  is  called  formatting.  The  physical  representation  may 
be  oriented  to  a specific  output  device.  The  physical  representation  of  the  document 
is  then  converted  into  a page  representation  --  a representation  in  the  format 
expected  by  a specific  device.  Page-structure  analysis,  in  some  senses,  is  the  inverse 
problem  of  interpreting  format-control  commands.  It  attempts  to  find  or  understand 
the  control  commands  that  could  have  been  used  for  laying  out  image  documents. 

Document  structure  varies  from  one  type  of  document  to  another.  It  is  not 
practical,  nor  easy,  to  develop  a general  system  to  analyze  all  types  of  documents 
automatically.  Each  document-analysis  system  has  to  focus  on  certain  chosen  types 
of  document  that  are  most  often  needed  by  its  application.  There  is  no  generally 
agreed  upon  ideal  model  for  representing  document  structure.  It  should  be  noted  that 
a human  can  recognize  the  structure  of  a page  immediately  after  it  is  displayed.  For 
a document-analysis  system  to  cope  with  general  classes  of  documents,  a provision 
for  interactive  page  structure  identification  by  the  user  will  be  extremely  useful  and 
powerful.  Some  document  structures  will  be  discussed  in  the  following  section. 

3.6.1  Office  Document  Architecture  ( PDA  I and  its  Structure 

The  Office  Document  Architecture  provides  a hierarchical  and  object-oriented 
document  model.  A document  is  best  thought  of  as  a tree,  where  the  structure  is 
defined  by  the  shape  of  the  tree  and  the  content  is  stored  entirely  in  the  leaves  of  the 
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tree.  The  ODA  document  is  described  by  a logical  structure  and  a layout  structure. 
The  logical  structure  divides  and  subdivides  the  document  into  items  that  mean 
something  to  the  human  author  or  reader,  while  the  layout  structure  divides  and 
subdivides  a visible  representation  of  the  document  into  rectangular  areas.  Logical 
objects  represent  general  items  like  titles  and  paragraphs,  and  layout  objects 
represent  sets  of  rectangular  areas  within  pages. 

The  common  item  to  both  structures  is  clearly  the  content  which  provides  the 
link  between  them  as  shown  in  Figure  3-6.  To  illustrate  the  structures  we  shall  use 
a simple  technical  document  divided  into  Parts  and  Sections.  Initially  we  shall  assume 
that  each  Part  has  a title  followed  by  one  or  more  Sections,  and  that  each  Section 
in  turn  has  a subtitle  followed  by  a series  of  one  or  more  paragraphs. 


Figure  3-6.  Logical  and  layout  structure. 
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The  logical  structure  shows  that  the  fragment  consists  of  the  Part  title  and  the 
beginning  of  the  first  Section,  including  the  subtitle  and  paragraphs  in  an  actual 
layout  on  pages  shown  in  Figure  3-7.  Layout  structure  shows  that  there  are  four 


Figure  3-7.  An  example  of  actual  layout  on  pages, 
blocks  for  the  first,  that  is,  left  page  and  two  blocks  for  the  second  page.  Only  the 
leaves,  represented  as  the  block,  in  the  tree  structure  have  contents  associated  with 
both  structures.  The  content  of  a leaf  for  the  logical  object  frequently  corresponds 
to  the  content  of  a block.  This  gives  the  neat  one-to-one  correspondence  between  the 
leaves  of  the  logical  and  layout  structure  shown  in  Figure  3-6.  However,  when  a 
paragraph  is  split  over  two  portions  and  associated  with  two  separate  blocks 
belonging  to  two  different  pages,  the  one-to-one  correspondence  between  two 
structures  does  not  exist.  Alternatively  the  content  portions  belonging  to  several 
logical  objects  may  be  run  together  into  a single  layout  block. 


3.6.2  The  Structures  for  Office  Document  Architecture 


The  Office  Document  Architecture  consists  of  two  sets  of  object  class 
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descriptions  such  as  one  for  logical  objects  and  one  for  layout  objects.  These  sets  of 
descriptions  define  the  types  and  combinations  of  objects.  The  qualifiers  concerning 
the  occurrence  of  a subordinate  object  are  optional,  required,  repetitive  or  optional 
and  repetitive.  For  the  groups  of  subordinate  object,  there  is  a sequence,  an 
aggregate,  or  a choice.  One  of  the  generic  logical  structure  for  a Part  in  the 
document  is  defined  as  shown  in  Figure  3-8.  Each  object  is  assumed  to  be  required 
unless  shown  otherwise,  so  this  indicates  that  a Part  begins  with  a required  title, 
followed  optionally  by  an  author’s  name,  followed  by  one  or  more  Sections. 


Figure  3-8.  Generic  logical  structure. 
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Each  Section  begins  with  a required  subtitle  and  then  consists  of  a mixture  of 
paragraphs,  diagrams  and  lists.  Repetitive  and  Choice  represents  a series  of  one  or 
more  items  occurring  in  random  order.  Lists  consist  of  one  or  more  list  items,  while 
diagrams  consist  of  a picture  above  the  caption  or  a caption  above  the  picture.  A 
simple  specific  instance  of  the  page  set  with  a single  continuation  page  is  shown  in 
Figure  3-9. 


Title  Page 


Continuation  Page 


Continuation  Body 
Frame 

(Body) 


Figure  3-9.  Specific  instance  of  "part  page  set". 

Several  different  views  of  a logical  ODA  document  can  be  obtained  by 
altering  the  generic  layout  structure  and/or  the  sets  of  the  presentation  and  layout 
styles.  As  a simple  instance,  deleting  the  "Body  frame"  from  the  "Title  page"  in  Figure 
3.9  would  cause  each  Part  of  the  document  to  be  laid  out  with  only  the  Part  title  and 
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author’s  name  on  the  first  page.  Because  there  would  be  no  frame  on  the  first  page 
with  "Body"  as  its  permitted  category,  the  first  Section  would  have  to  start  in  a 
"Continuation  body  frame"  on  a subsequent  page. 

Altering  the  attributes  that  make  up  the  representation  and  layout  styles  can 
produce  more  radical  changes.  Although  these  attributes  refer  to  logical  objects,  they 
are  held  separately  from  the  main  logical  structure.  This  leads  to  a more  concise 
representation  of  the  document.  The  layout  styles  include  the  important  layout  object 
class  and  layout  category  attributes.  Magnificent  changes  to  the  positioning  and 
ordering  of  items  could  be  made  by  changes  to  these  attributes.  The  presentation 
styles  are  used  to  guide  the  lower-level  content  layout  process  and  thus  affect  the 
appearance  of  the  content  within  the  blocks.  They  contain  different  attributes  for 
different  content  architecture.  For  character  content,  for  example,  they  include 
attributes  affecting  the  font  and  size  of  characters,  the  distance  between  lines  and  the 
indentation  of  the  first  line.  Changing  both  the  generic  layout  structure  and  the  styles 
can  lead  to  significantly  different  views  of  the  same  logical  document.  Page  and 
margin  sizes  can  vary,  single  or  double  column  output  can  be  used,  and  paragraph 
and  font  details  can  be  changed. 


3.7  Summary 

In  this  dissertation,  we  mainly  discussed  the  block  segmentation  and 
classification  of  the  document  analysis  system.  The  segmentation  of  document  image 
data  is  done  by  a relatively  big  size  of  block  separated  by  white  space.  The  skewed 
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document  image  data  not  only  can  separate  the  document  inappropriately  but  also 
induce  misclassification  of  the  blocks  in  most  of  the  previously  published  works.  This 
chapter  described  the  development  and  implementation  of  the  new  algorithm  for 
automated  separation  and  analysis  of  text,  graphics,  and  halftone  images.  The  new 
algorithm  for  block  segmentation  connected  each  component  to  divide  the  documents 
into  blocks  separated  by  space  for  page  layout.  In  connecting  components,  the 
proposed  algorithm  applied  to  each  black  pixel  an  operation  which  generates  a 
linearly  expanded  contour  of  each  component.  This  algorithm  has  time  complexity 
of  N2  like  most  previous  top-down  approaches  for  block  segmentation;  however,  it 
is  invariant  to  skew  and  even  faster  than  most  previous  approaches  for  most  input 
cases  even  if  they  have  the  same  time  complexity  of  N2. 

As  the  very  next  stage  to  the  block  segmentation,  each  block  should  be 
classified  according  to  what  it  possesses  and  block  classification  rules  are  required 
so  as  not  to  be  restricted  to  error-free  document  image  data.  The  proposed  method 
is  insensitive  to  skew  and  is  far  superior  to  published  methods,  which  are  seriously 
impaired  by  a skew  of  less  than  a few  degrees.  Unlike  previous  classification  rules, 
this  algorithm  used  characteristics  which  are  insensitive  to  document  skew.  Several 
measurements  are  considered  for  developing  a block  classification  rule;  for  instance, 
a text  block  is  composed  of  text  lines  and  each  text  line  consists  of  a variety  of 
characters.  The  basic  components  of  characters  are  line  segments  which  have  uniform 
width.  On  applying  the  proposed  pixel  operation,  the  variation  of  boundary  pixels  of 
total  black  image  pixels  can,  in  every  instance,  distinguish  the  line  segments  from  the 
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blob  segments.  For  the  separation  of  text  and  complex  fine  line  drawing,  the  total 
number  of  neighbors  for  each  black  pixel  shows  that  it  has  close  relationship  with 
compactness.  The  features  such  as  probability  of  occurrence  of  black  pixels  for  block, 
the  number  of  black  pixels  after  applying  consecutive  operations,  the  black-white 
transitions,  and  the  total  number  of  black  neighbors  for  each  black  pixel  in  the 
original  image  are  used  for  a new  classification. 


CHAPTER  4 

RECOGNITION  OF  THE  CLASSIFIED  BLOCK 
4.1  Introduction 

Designing  a system  that  not  only  recognizes  the  text  blocks  but  also 
understands  the  nontext  blocks  requires  advanced  analysis  and  synthesis  technologies. 
These  technologies  are  of  two  types:  technology  for  text  recognition  and  for  nontext 
understanding.  Much  work  has  been  done  for  text  recognition  known  as  character 
recognition  since  the  late  1950s.  The  results  are  categorized  in  several  sub-areas. 
Character  recognition  is  largely  classified  as  on-line  or  off-line  character  recognition. 
The  on-line  character  recognition  makes  use  of  the  order  of  strokes  made  by  the 
writer,  whereas  the  off-line  case  treats  the  completed  characters  written  on  a sheet 
or  on  some  other  material.  The  on-line  character  recognition  deals  with  a one- 
dimensional representation  of  the  input,  whereas  the  off-line  case  involves  analysis 
of  a two-dimensional  image.  Order  information,  in  the  case  of  on-line  character 
recognition,  obtained  by  writing  on  an  electronic  bit  pad  which  causes  the  two- 
dimensional  coordinates  of  successive  points  to  be  stored  in  order,  eases  the 
recognition  problems  compared  to  off-line  character  recognition.  The  off-line 
character  recognition  will  only  be  described  in  this  chapter  since  our  document 
analysis  system  deals  with  paper-based  documents. 


73 


74 


Designing  such  a complete  system  of  understanding  two-dimensional  page 
images  with  printed  characters  and  graphics  consists  of  integrating  work  from  several 
problem  areas.  Selection  of  an  appropriate  type  of  feature  is  the  most  important 
thing  in  designing  the  system.  Two  types  of  features,  such  as  global  [Bal82,  Per77] 
and  structural  or  local  [Hua86a,  Cox82],  are  considered  as  prospective  features  for 
developing  character  recognition  systems. 

4,2  Recognition  of  Text 

Text  recognition,  considered  as  the  union  set  of  character  recognition,  requires 
some  preliminary  processing,  such  as  word  segmentation  and  character  segmentation, 
before  it  can  extract  features.  The  block  diagram  of  a typical  character  reader  is 
shown  in  Figure  4-1.  Each  character  is  read  and  digitized  by  an  optical  scanner.  Each 
character  is  located  and  segmented  by  software  control  of  the  computer.  The 
resulting  matrix  is  then  fed  into  a preprocessor  for  further  processing  steps.  As  an 
early  stage  for  recognition  of  text,  word  segmentation  is  performed  to  separate  each 
word.  The  textual  knowledge  that  a word  usually  is  the  combination  of  characters 
lying  on  a straight  text  line  with  the  distance  between  words  in  the  text  line  longer 
than  the  distance  between  characters  in  the  word  provides  some  information  for 
separating  the  words.  The  generation  of  eight  connected  components  is  used  to  group 
together  black  pixels  which  are  eight  connected  to  one  another.  The  eight  connected 
pixels  belonging  to  individual  characters  or  graphics  are  enclosed  in 
circumscribing  rectangles.  Each  rectangle  is  identical  to  a single  connected 
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Figure  4-1.  The  block  diagram  of  a typical  character  reader. 
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component.  On  applying  eight  connectedness,  it  automatically  segments  the 
characters  in  the  word  with  an  exception  of " i j ",  and  some  marks  such  as  " ; 
However,  this  method  works  only  for  alphabetical  characters.  A global  operator 
which  connects  the  characters  within  a certain  distance  can  be  used  to  separate  the 
words. 

The  segmentation  of  closely  spaced  printed  characters  is  considered  for 
combining  segmentation  with  classification  by  means  of  an  adaptive  decision  tree 
[Cas82].  The  pattern  array  to  be  resolved  is  viewed  by  the  classifier  through  the 
window.  A supervisory  routine  takes  control  of  the  window’s  width  and  location.  The 
window  is  initially  set  at  the  full  width  of  the  patterns  so  that  if  the  array  contains  a 
single  character,  the  classifier  can  recognize  it  in  one  step.  The  viewing  window  is 
narrowed  from  the  right-hand  side  and  the  classifier  is  applied  to  the  truncated  array 
when  the  classifier  rejects  the  pattern.  The  rejection  of  the  classifier  indicates  that 
the  full  array  does  not  belong  to  the  alphabet.  This  process  is  repeated  until  either 
the  truncated  array  is  successfully  recognized,  or  the  window  becomes  so  narrow  that 
the  search  is  given  up.  A window  narrowing  operation  is  attempted  in  both  directions 
when  the  search  fails.  The  segmentation  terminates  successfully  if  the  residual  array 
after  a positive  classification  is  either  null  or  narrower  than  any  character  in  the 
alphabet. 

In  the  following  section,  only  the  recognition  of  machine-printed  characters 
in  Latin  fonts  will  be  described.  Current  objectives  in  the  document-analysis  system 
are  mainly  to  read  machine  printed  characters,  although  techniques  to  be  described 
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are  capable  of  recognizing  handwritten  and  script  characters  with  accuracy  and  speed. 
Recognition  of  oriental  characters  such  as  Chinese,  Korean,  and  Japanese  will  not 
be  considered  because  these  characters  involve  large  alphabets. 

4.2.1  The  Selection  of  Feature  in  Character  Recognition 

The  selection  of  feature  is  the  most  important  step  to  simulate  the  machine 
like  human  reading  with  the  machine.  Two  types  of  features,  such  as  global  and 
structural  or  local  feature,  are  customarily  used  for  automatic  character  recognition. 
Techniques  such  as  (1)  template  matching  or  (2)  various  mathematical 
transformations  treat  the  character  matrix  as  a whole  and  select  global  features  from 
it.  On  the  other  hand,  the  structural  or  local  feature  is  based  on  geometrical  and 
topological  properties  of  the  characters.  These  features  include  interesting  points  and 
subpieces.  Of  the  approaches  with  global  features,  template  matching  is  a well  known 
pattern  matching  process  [Bal82].  This  technique  simply  measures  the  similarity 
between  the  input  character  and  the  stored  references  matching  points  in  the  frame. 
A conventional  template  matcher  calculates  the  similarity  between  a pair  of  vector 
patterns  by  summing  the  number  of  picture  elements  (pixels)  for  which  both  patterns 
differ  using  Exclusive  OR.  The  Exclusive  OR  error  is  defined  as  E = EE  A(x,y)  © 
B(x,y)  where,  A(x,y)  and  B(x,y)  represent  the  picture  elements  at  location  (x,y)  and 
© denotes  logical  Exclusive  OR. 

A major  shortcoming  of  the  conventional  template  matcher  above  is  that  it 
treats  all  errors  alike  regardless  of  where  they  occur  spatially.  In  Figure  4-2,  pattern 
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A and  pattern  B are  different  characters  while  pattern  A and  pattern  B in  Figure  4-3 
are  the  same  character.  The  Exclusive  OR  error  in  Figure  4-3  should  be  less  than  the 
Exclusive  OR  error  of  Figure  4-2  in  order  to  succeed  in  recognition.  However,  the 
Exclusive  OR  count  for  different  characters  is  greater  than  for  the  same  character. 
In  order  to  improve  this  drawback,  weighted  Exclusive  OR  error  can  be  utilized.  In 
the  example,  we  used  a 3 x 3 window  to  get  the  weighted  Exclusive  OR  count.  The 
weighted  Exclusive  OR  count  for  the  same  character  is  less  than  for  the  different 
character.  The  drawback  of  template  matching  is  its  high  dimensionality  and  its 
sensitivity  to  translation,  rotation,  and  scaling.  High  dimensionality  of  the  character 
feature  vectors  in  template  matching  requires  large  storage  and  long  computation 
time. 

Several  orthogonal  transformations  have  been  explored  as  possible  feature 
extractors  in  order  to  reduce  high  dimensionality  of  template  matching.  The  Fourier 
descriptor,  one  of  the  transformational  approaches,  is  proposed  to  reduce  the  high 
dimensionality  of  template  matching,  and  to  extract  features  invariant  to  global 
deformation  [Per77].  This  rotational  transformation  has  been  explored  along  with 
others  by  Walsh  [And71],  and  Haar  and  Hadamard  [Wen78].  Zahn  and  Roskies 
applied  the  Fourier  series  to  describe  plane  closed  curves  [Zah72].  The  basic  idea 
for  Fourier  descriptors  is  that  a closed  curve  can  be  represented  by  a periodic 
function  of  a continuous  parameter,  or  alternatively,  by  a set  of  Fourier  coefficients 
of  this  function.  The  coefficients  in  this  collection  are  referred  as  Fourier  descriptors. 
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Figure  4-2.  The  template  matcher  for  different  character. 
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Figure  4-3.  The  template  matcher  for  same  character. 
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In  order  to  use  these  descriptors  for  pattern  classification  applications,  the  curve 
representation  must  be  normalized  with  respect  to  a desired  transformation  class. 

The  Fourier  descriptors  are  defined  as  follows:  The  function  0(1)  is  the 
angular  direction  of  closed  curve  y.  The  function  <|>(1)  is  the  net  amount  of  angular 
bend  between  starting  point  1 = 0 and  point  1.  So  4>(1)  is  represented  as  <|>(1)  = 0(1)-0(O) 
except  for  possible  multiples  of  2tt,  and  <J>(L)  = -2ji.  The  function  <J>(1)  simply  contains 
absolute  size  information,  we  need  to  normalize  <t>(t)  to  make  it  a periodic  function. 
The  4>(t)  for  character  H is  shown  in  Figure  4-4(a).  Figure  4-4(b)  shows  the 
normalized  4>(t),  defined  as  4>*(t).  We  define  <J>*(t)  as  <J>*(t)  = <J>(Lt/2ir)  + 1. 

The  normalized  function  <J>*(t)  is  a periodic  function,  so  we  can  expand  <J>*(t) 
in  its  Fourier  series  as  follows. 


4>*(0  = P0  + £ Ak  cos  (**  - «*)  C4-1) 

k=  1 

Then  the  set  {Ak,  ak ; k = l,2,...,°°}  are  the  Fourier  descriptors  for  curve  y.  The  main 
drawbacks  of  global  techniques  are  their  dependence  on  position  alignment  and  high 
sensitivity  to  distortion  on  style  variation. 

4,2.2  Geometrical  and  Topological  Properties  in  Character  Recognition 

Geometrical  and  topological  features  for  character  recognition  are  based  on 
the  extraction  of  features  which  describe  the  interesting  geometry  or  topology  of  the 
character  as  a drawing.  These  features  may  represent  global  and  local  properties  of 
the  character.  This  is  by  far  the  most  popular  technique  from  the  view  point  of 
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Figure  4-4.  The  Fourier  descriptor  and  normalized  function. 
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sensitivity  and  a proficiency  for  implementation.  The  sensitivity  to  the  deformation 
of  a character  image  is  caused  by  several  of  the  following  factors: 

a)  font  variation  - the  use  of  a different  font  to  represent  the  same  character. 

b)  rotation  --  change  in  orientation. 

c)  translation  --  movement  of  the  whole  character. 

d)  noise  - which  causes  disconnected  line  segments,  filled  loops,  etc. 

The  proficiency  of  implementation  is  evaluated  with  respect  to  speed,  complexity, 
independence,  and  automatic  mask  making. 

The  main  advantages  of  geometrical  and  topological  features  are  their  high 
tolerance  to  font  variation  and  noise  compared  to  other  techniques.  The  geometrical 
and  topological  features  as  structural  or  local  types  of  features  are  applied  mainly  for 
recognition  of  handwritten  characters  which  have  much  style  variation.  The  letters 
for  local  features  are  considered  as  abstract  letters  which  represent  only  the  main 
aspects.  The  abstract  letters  do  not  include  character  embellishments  which  physical 
characters  have.  The  partial  abstract  representation  of  letters,  termed  a skeleton, 
contains  vertices,  spatial  ordering  between  vertices  called  edges,  and  relationships 
between  vertices  and  edges  as  the  basic  elements.  The  vertex  denoted  by  the  symbol 
shown  in  Figure  4-5  is  taken  as  primitive,  whereas  edges  specify  the  spatial  ordering 
which  exists  between  vertices.  The  relationships  between  vertices  and  edges  represent 
the  interesting  aspects  which  are  shown  in  Figure  4-6. 

These  interesting  points  are  divided  into  several  categories:  endpoint, 
forkpoint,  crosspoint,  and  breakpoint.  Each  interesting  point  has  its  own  number  of 
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branches;  for  example,  crosspoint  has  four  branches  and  forkpoint  has  three 
branches.  The  interesting  points  above  can  partition  the  characters  in  subpieces.  The 
start  and  the  end  points  of  each  subpiece  are  found  and  their  corresponding 
categories  are  recorded.  The  orientation  and  the  length  of  each  subpiece  are  also 
recorded,  and  the  length  is  normalized  by  the  character  length  after  tracing  the  whole 
character. 


I / 

• > ' • 

Figure  4-5.  Vertex  and  edge. 


Figure  4-6.  The  relationship  between  vertices  and  edge  as  an  interesting  aspect. 
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4.2.3  Word  Recognition  bv  Contextual  Information 

The  performance  of  a character  recognition  system  can  not  be  based  on  only 
a single-character  recognition.  Contextual  information  can  increase  recognition 
accuracy  just  as  humans  rely  on  using  contextual  information  in  reading  text  [Ehr75, 
Dos77].  The  input  of  a contextual  word  recognition  system  consists  of  a string  of 
characters  which  are  recognized  through  the  character  recognizer.  The  application 
of  context  makes  it  possible  to  detect  or  even  to  correct  them.  In  a contextual  word 
recognition  system,  the  character  recognizer  module  assigns  to  each  input  character 
26  numbers  showing  the  confidences  that  the  character  in  the  input  has  labels  from 
a to  z.  The  confidences  are  then  transformed  to  probabilities.  If  a character  has  two 
or  more  labels  with  non  zero  probabilities,  it  is  difficult  to  determine  which  label  is 
the  correct  one.  A string  of  characters  delimited  by  spaces  constitutes  a word.  Since 
each  character  in  a word  may  have  a set  of  alternative  labels,  the  output  of  the 
character  recognizer  is  actually  a sequence  of  sets  called  substitution  sets,  each  of 
which  contains  the  alternatives  for  a particular  character  with  nonzero  probability. 
All  possible  words  are  obtained  by  selecting  one  character  from  each  of  the 
substitution  sets.  It  is  obvious  that  only  one  of  the  words  that  can  be  formed  from  the 
substitution  sets  is  the  correct  word.  The  problem  in  contextual  word  recognition  is 
to  determine  the  combination  of  labels  0l502, ...,6n  that  maximizes  the  a posteriori 
probability  p(01,02,...,0nlX1...Xn)  for  a word  of  length  n as  the  output  of  a character 
recognizer,  where  a word  XjX2...Xn  is  the  output  of  a character  recognizer.  The 
probability  that  the  true  label  of  character  Xj  is  0j  is  expressed  as  pW). 
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4.3  Recognition  of  Nontext 

Unlike  text  recognition,  the  understanding  of  nontext  is  performed  for  the 
limited  number  of  nontext  items  such  as  prestored  labels  and  symbols.  The  difficulty 
for  perfect  understanding  of  graphics  images  comes  from  limitless  creation  of 
graphics  images  in  addition  to  complexities.  Even  though  designers  try  to  draw 
unique  labels,  there  are  many  similarities  between  some  of  them.  Thus,  perfect 
understanding  of  the  graphics  image  has  appeared  to  be  impossible;  nevertheless, 
some  efforts  have  been  made  for  this  understanding.  An  effort  to  understand 
geometrical  configuration  by  computer  has  been  explored  in  the  literature  [Tou80]. 
The  literature  introduces  a novel  algorithmic  approach  to  automatic  understanding 
of  geometrical  configuration  by  computer.  It  describes  a fundamental  problem  in 
automatic  picture  understanding  as  designing  a computer  system  to  analyze,  interpret, 
and  describe  such  geometrical  configurations. 

In  designing  the  system  for  understanding  the  graphics  image,  a conventional 
vision  system  simulates  the  human  vision  assuming  that  it  is  able  to  identify  objects 
seen  previously.  Human  vision,  however,  can  identify  pictures  without  learned 
convention;  this  has  been  demonstrated  in  a study  by  Hochborg  and  Brooks  [Hoc62]. 
They  raised  their  son  allowing  him  not  to  see  any  pictures,  even  advertisements  or 
food  containers  or  billboards,  for  the  first  two  years.  Nevertheless,  the  boy  who  was 
two  years  old  then  had  no  difficulties  identifying  pictures  such  as  simple  line  drawings 
of  shoes  and  other  familiar  objects,  even  complicated  ones.  Thus,  human  capability 
for  perception  and  recognition  is  not  a matter  of  learned  convention.  We  are  going 
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to  design  the  recognition  system  based  on  conventional  design  methodology  which 
supposes  that  humans  begin  with  careful  observation  and  then  interpret  what  they 
memorize  in  order  to  understand.  The  conventional  recognition  system  requires  a 
learning  stage  including  tasks  such  as  image  processing  and  picture  understanding. 
The  image  processing  task  is  concerned  with  the  determination  of  nontext  type  in 
addition  to  the  transmission  of  graphics  images.  From  the  viewpoint  of  preprocessing 
for  understanding  graphics  images,  the  nontext  is  classified  into  two  types.  One  is  the 
line  type  which  is  composed  of  line  components,  and  its  skeleton  does  not  destroy 
most  properties.  The  other  is  the  blob  type  whose  properties  mostly  exist  on  the 
boundary. 

4.3.1  Determination  of  NonText  Type 

Of  the  image  processing  tasks,  thinning  is  one  of  the  most  important  image 
processing  algorithms  and  is  used  frequently  to  simplify  image  data  in  order  to  ease 
extracting  features  from  the  image  data.  Thinning  should  be  applied  to  the  graphics 
image  which  will  not  be  affected  in  losing  the  feature.  In  other  words,  thinning  is 
applied  to  simplify  the  boundary  image  by  reducing  it  to  its  skeleton  without 
destroying  its  geometrical  shape  and  connectivity.  Thus  it  is  necessary  to  discriminate 
the  graphics  image  before  the  thinning  process  is  applied  it.  The  graphics  image 
should  be  divided  by  graphics  image  thinning,  either  applicable  or  not  applicable. 
The  procedures  for  deciding  the  graphics  image  type  can  be  informally  described  as 


follows. 
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(1)  Apply  operator  Bj  to  the  image  block  at  every  location  until  the  number  of  black 
pixels  reaches  zero;  then  keep  the  number  of  iterations.  ( Countiteration  = no. 
of  iteration  ) 

(2)  Count  the  number  of  black  pixels  when  applying  Bj  Countiteration  - 1 times,  set 
n = 1 and  find  out  weights  for  the  pixels  at  each  location  by  using  3x3  windows. 

(3)  If  the  number  of  black  pixels  resulting  from  (2)  is  greater  than  a predefined 
value  and  there  exist  half  the  predefined  value  of  pixels  with  weight  less  than 
four,  then  the  graphics  image  block  is  line  type. 

(4)  Otherwise,  use  3x3  windows  to  estimate  weighted  black  pixel  value  after 
applying  operator  Bj  Countiteration  - fn-t-l)  times.  Find  out  line  component  by 
checking  the  number  of  pixels  with  weight  three.  If  the  count  exceeds  a certain 
value  which  makes  graphics  image  look  line  type,  then  classify  it  as  line  type. 

(5)  If  there  is  no  line  component  by  3 x 3 windows  in  step  (4),  then  count  the  total 
number  of  black  pixels.  If  the  number  of  total  black  pixels  is  bigger  than  2x8 
times  of  the  number  of  black  pixels  after  applying  Bj  Countiteration  - n times, 
then  it  is  line  type. 

(6)  Set  n = n + 2,  then  repeat  step  (4)  and  (5)  until  n reaches  Countiteration  - 1. 

(7)  If  n reaches  Countiteration  - 1 without  satisfying  the  condition  of  either  (4)  or 
(5),  then  it  is  blob  type. 

Examples  in  Figure  4-7,  and  4-8  show  the  characters  which  are  represented 

by  bit-map.  Both  (c)  and  (d)  in  Figures  4-7  and  4-8  show  us  the  number  of  black 

pixel  after  applying  operation  B:  once.  If  we  apply  the  operation  Bj  twice,  then  it 
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makes  the  number  of  black  pixels  equal  to  zero.  The  numbers  of  black  pixels  in  (c) 
and  (d)  of  Figure  4-7  are  relatively  large.  Thus,  it  can  be  line  type  by  step  3 of  the 
procedures.  Similarly,  characters  in  Figure  4-8  can  be  classified  into  line  type  by  step 
5 in  the  procedures. 

4,3.2  Classification  of  Line  Drawing 

The  computer  understanding  of  line  drawings  has  been  explored  for  the 
industrial  applications  such  as  CAD/Design  Automation  techniques  of  electronic 
circuit  diagrams  [Tou87]  and  interpretations  of  mechanical  design.  The  primary 
components  of  most  line  drawings  are  lines.  These  lines  have  a special  meaning  for 
certain  line  drawings  and  not  for  the  other  line  drawings.  The  former  can  be  referred 
to  as  a line  context  sensitive  line  drawing,  while  the  latter  is  referred  to  as  a line 
context  free  line  drawing.  Thus,  prior  to  applying  each  understanding  stage  to  the  line 
drawing,  the  classification  of  line  drawing  as  either  line  context  sensitive  or  line 
context  free  is  required. 

4.3.2. 1 The  classification  algorithm  for  line  context  free  line  drawings 

The  primary  feature  of  line  drawings,  either  hand  or  CAD  drawn  and  often 
seen  in  pages  along  with  various  picture  images,  is  line.  Despite  the  fact  that  line 
drawings  are  composed  of  line  components,  line  itself  is  not  the  feature  to  classify 
the  line  drawings  in  line  context  free  line  drawings  which  usually  do  not  include 
three-dimensional  information.  The  most  prominent  feature,  which  can  differentiate 
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Figure  4-7.  The  bit-maps  of  characters  composed  of  line  components. 
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Figure  4-8.  The  bit-maps  of  different  character  composed  of  line  segments. 
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line  drawings  from  context  free  line  drawings,  are  the  components.  The  components 
in  the  line  drawing  are  usually  drawn  as  lines  which  connect  the  components,  for 
example,  lines  in  the  electronic  circuit  diagram.  Chemical  line  drawings,  especially 
those  of  organic  chemistry,  include  English  characters  such  as  C,  O,  H or  others. 
Meanwhile,  the  most  prominent  feature  in  mechanical  and  architectural  line  drawing 
are  lines  with  3-dimensional  information.  In  line  context  free  line  drawing,  the 
components  which  exist  in  the  line  drawing  provide  the  evidence  to  classify  them. 
The  procedures  for  extracting  evidence  from  the  line  drawing  are  informally 
described  as  follows. 

(1)  Set  the  number  of  iterations  to  1 (Count  - 1),  and  estimate  the  width  of  line 
(Width)  which  is  represented  as  the  number  of  pixels  in  the  bit-map. 

(2)  Apply  the  operator  OPj  to  every  black  pixel  of  the  line  drawing  to  be  classified. 

(3)  If  there  exists  a white  pixel  with  black  neighbors  in  the  directions  either  (0,4)  or 
(2,6),  then  repeat  step  2 and  set  Count  = Count  + 1. 

(4)  Otherwise,  apply  the  operator  Bj  to  every  black  pixel  Count  + (Width  - 2)  times. 

(5)  Apply  the  operator  OP!  (Width  - 2)  times  to  every  black  pixel. 

(6)  Extract  components  combining  the  result  of  (5)  with  the  original  line  drawing 
image  by  a logical  AND  at  each  pixel  location. 

4.3.2.2  The  attempt  for  classification  of  line  drawings  with  line  context  sensitive  lines 
The  classification  of  context  sensitive  line  drawings  is  directly  connected  to 
3-D  object  recognition.  The  understanding  of  line  drawing  has  been  performed  in 
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various  aspects  to  recognize  3-D  objects.  Huffman  and  Clowes  demonstrated  labelling 
techniques  to  interpret  the  line  drawing  [Huf71,  Clo71].  In  an  attempt  to  capture 
quantitative  aspects  of  shape  and  to  handle  arbitrary  polyhedra,  Mackworth  [Mac85] 
utilized  the  concept  of  Gradient  Space.  However,  this  approach  could  not  guarantee 
realizability,  that  is,  there  may  not  exist  polyhedral  scenes  corresponding  to  a 
labelling.  Other  researchers  have  also  studied  the  interpretation  of  line  drawing. 
Nevertheless,  there  does  not  exist  a rigorous  approach  to  understanding  the  line 
drawing,  or  even  discriminating  the  line  drawing  of  3-D  objects  from  the  simple  2-D 
line  drawing.  One  difficulty  in  the  understanding  of  line  drawing  is  that  the  image 
data  do  not  contain  3-D  information. 

Despite  this  fact,  humans  can  interpret  line  drawings  with  little  difficulty.  As 
an  attempt  to  interpret  line  drawings,  volumetric  primitives  will  be  used  to 
discriminate  the  line  drawing  of  3-D  objects  from  simple  2-D  line  drawings  as  the 
very  first  step  of  interpreting  line  drawings.  Lines  in  machine  drawing  with  line 
context-sensitive  lines  are  classified  into  object  lines  and  interpretation  lines.  The 
majority  of  the  object  lines  describe  the  object’s  visible  contour  using  a solid  thick 
font,  hidden  lines,  axis  of  symmetry  lines,  or  cross-sectioned  planes.  Meanwhile  the 
interpretation  lines  provide  the  object’s  precise  geometric  description  along  with 
other  information  necessary  for  producing  the  object,  and  are  classified  into 
dimension  lines  and  auxiliary  lines.  These  are  further  classified  into  manufacturing 
lines  and  logistic  lines,  such  as  the  title  and  frame  of  the  drawing,  part  list,  and  part 
numbers.  These  lines  represent  either  2-D  or  3-D  objects.  Thus,  dimensioning  plays 
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a vital  role  in  classifying  line  drawing  for  the  preliminary  step  in  understanding 
machine  drawings.  It  is  aimed  at  providing  an  exact  definition  of  the  2-D  geometry 
of  the  projection,  whose  approximate  shape  is  described  graphically  by  objects  lines. 

4,3.3  Recognition  of  Line  Drawings 

In  the  understanding  of  graphic  images,  we  will  be  concerned  only  with  line 
drawings.  Images  such  as  pictures  are  not  ordinarily  subjects  for  recognition  in  a 
document  analysis  system.  There  are  many  basic  procedures  for  recognizing  line 
drawings.  Of  these  procedures,  detection  of  lines  will  be  described  because  the  most 
dominant  element  in  the  line  drawing  is  the  line  segment.  Since  images  in  line 
drawings  usually  involve  an  enormous  amount  of  pixel  data,  processing  procedures 
should  be  done  efficiently  and  fast. 

4.3.3. 1 Detection  of  Lines 

Of  various  methods  of  detecting  line  segments,  the  most  common  is  the 
application  of  a thinning  algorithm  to  obtain  medial  line  images  and  then  to  track 
the  medial  lines  by  using  eight  connected  neighbors  within  a 3x3  window.  This  is 
illustrated  schematically  in  Figure  4-9.  Another  algorithm,  referred  to  as  longest 
vector-detection  method,  is  to  search  the  logical  1 pixels  and  find  the  longest  vector 
drawable  on  the  line  segment.  When  the  longest  vector  is  found,  line  tracking  is 
halted  to  store  the  coordinates  of  both  ends  of  the  vector.  Then,  the  pixels  that  make 
up  the  line  are  flagged  to  show  that  the  search  has  been  completed.  New  tracking 
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Figure  4-9.  Medial  line-tracking  method. 
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resumes  from  the  last  point  just  stored.  Thus,  line  images  can  be  represented  as  a set 
of  vectors,  with  each  vectored  line  consisting  of  a pair  of  point  coordinates.  Using 
these  line-detection  methods,  a straight  line  is  usually  represented  by  a long  vector, 
while  a curved  line  can  be  approximated  by  a set  of  short  vectors,  each  having 
gradually  changing  directions.  The  line  detection  by  thinning  algorithm  is  a time- 
consuming  process,  whereas  the  longest  vector-detection  method  is  reasonably  fast 
when  a drawing  consists  of  long  straight  lines. 

4.3.32  Recognition  of  Line  Types 

The  main  objective  in  line-type  recognition  is  to  find  the  route  for  each  line 
type  by  analyzing  the  set  of  short  line  segments.  Extracting  subsets  of  short  line 
segments  can  lead  to  finding  broken  lines.  For  this  purpose,  a two-level  recognition 
algorithm  is  provided:  one  level  is  for  local  recognition,  the  other  is  for  global 
recognition.  In  local  recognition,  one  line  segment  is  chosen  as  the  starting  segment. 
A small  search  area  is  generated  at  the  end  of  the  segment,  and  the  starting  point 
of  the  new  segment  is  found  within  this  area  as  a segment  is  checked  to  see  if  it  lies 
in  the  same  direction  as  the  preceding  line  segment.  For  a same  direction  line 
segment,  the  length  of  the  segment  is  checked  and  accumulated  for  statistical  analysis 
to  determine  the  line  type.  This  process  continues  until  the  line  types  meet. 

When  a mismatch  occurs,  a probable  line  type  for  the  preceding  line  segments 
is  stored.  The  process  is  begun  again  at  a point  after  the  mismatch.  From  that  point, 
new  tracking  begins  until  the  mismatched  line  segment  flagged.  This  procedure  can 
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find  all  broken  lines  sequentially  other  than  obscure  points  that  may  have  been 
flagged  during  the  local  recognition  process  and  may  still  remain  unclassified.  The 
global  recognition  process  is  initiated  to  check  for  conflicts  in  the  results  at  the 
starting  point  of  the  mismatched  line  segment.  If  the  global  processing  finds  a 
conflict,  and  the  connection  or  disconnection  of  points  is  required,  the  local 
recognition  procedure  is  again  applied  to  these  points  until  all  conflicts  are  resolved. 

4.3.3.3  Logic-Circuit  Diagrams 

In  electrical  engineering  work,  logic-circuit  diagrams  are  often  encountered. 
These  logic-circuit  diagrams  usually  consists  of  three  main  elements  such  as  lines 
representing  connecting  wires,  symbols  representing  logic-circuit  components,  and 
characters  and  numbers  for  component  names  and  attributes.  The  understanding  of 
such  logic-circuit  diagrams  can  be  described  by  the  following  sequence.  In  order  to 
execute  the  recognition  process,  first  of  all  characters  and  numbers  regarded  as  small 
in  size  compared  with  others  in  the  logic-circuit  diagram  are  separated  from 
component  symbols  and  wire  lines.  Next,  for  the  extracted  character  and  number 
image,  contour  tracking  is  executed.  A structure  analysis  of  the  resulting  set  of  point 
coordinates  that  make  up  the  contour  is  then  executed  to  recognize  the  characters 
and  numbers. 

The  remaining  image,  excluding  characters  and  numbers,  is  thinned  in  order 
to  extract  features  points  such  as  end  points,  branch  points,  cross  points,  and  corner 
points.  These  feature  points  are  analyzed  and  result  in  connection  information  for  the 
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wire  lines.  Loops  also  can  be  extracted  from  the  medial  line  data.  By  analyzing  the 
shape  of  the  loops,  component  types  are  recognized  in  one  of  the  four  orientations 
in  the  horizontal  and  vertical  directions.  The  components  and  characters  are  replaced 
with  prescribed  symbols  and  fonts  with  standard  shapes,  and  recognized  wire  lines 
are  straightened  to  be  horizontal  and  vertical. 

4.3.3.4  Mechanical  Engineering  Drawings 

Recognition  of  mechanical-engineering  drawings  poses  somewhat  different 
problems  from  recognition  of  schematic  drawings.  The  object  in  a mechanical 
drawing  is  usually  complex  and  intrinsically  three-dimensional.  The  industrial 
applications  of  mechanical  line  drawings  require  three-dimensional  model  data  for 
the  output  of  the  drawing-recognition  task.  The  recognition  of  mechanical 
engineering  drawings  requires  a highly  involved  performance  recognition  process.  In 
the  recognition  of  mechanical-engineering  drawings,  the  binarization  process  for 
drawings  necessitates  floating-type  binarization  using  a locally  adapted  threshold  in 
order  to  obtain  good  and  reliable  images. 

Symbols  are  discriminated  from  one  another  after  vectorizing  line  drawings. 
Some  of  the  symbols  in  mechanical  drawings  have  no  specific  meaning  by  themselves, 
whereas  they  take  on  meaning  when  combined  with  other  lines  and  symbols  in  the 
line  drawing.  The  symbols  with  meaning  include  sectioning  symbols,  production 
symbols,  symmetry  symbols,  and  dimensioning  symbols.  All  of  these  lines  provide  very 
important  information,  so  that  these  symbols  must  be  analyzed,  integrated,  and 
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interpreted  in  connection  with  the  lines  representing  the  object.  It  is  then  necessary 
to  reconstruct  the  two-dimensional  geometry  by  rectifying  the  vector  data  according 
to  the  dimension  information  derived  in  the  above  process.  Then,  by  combining 
views,  sections,  and  details  represented  in  the  line  drawing,  a three-dimensional 
model  can  be  reconstructed. 

4.3.3.5  Trademarks  and  Symbols 

An  effort  for  the  recognition  of  trademarks  by  computer  was  attempted  by 
geometrical  feature  selection  based  on  the  chain  code  [Tou87a].  Most  trademarks 
which  symbolize  companies  or  associations  are  designed  in  a pattern  which  should 
be  simple,  yet  distinct.  Even  though  designers  try  to  draw  a unique  trademark,  there 
are  many  similarities  between  some  of  them.  Trademarks  are  categorized  in  two 
forms;  one  is  an  alphanumeric  and  the  other  is  a symbolic  one.  Each  trademark  can 
further  be  classified  into  two  types,  the  blob  and  the  line  type.  The  line  types  are 
grouped  as  no  loop,  single  loop,  and  multiple  loops  depending  upon  the  number  of 
loops  in  the  trademark  symbol. 

The  recognition  of  trademarks  is  performed  separately  by  line  type  and  blob 
type.  In  the  understanding  of  the  line  type  trademark,  the  image  data  will  be  reduced 
through  reducing  steps  for  convenience  of  dealing  with  the  image  data.  The  main 
function  of  the  thinning  process  is  to  transform  a raster  image  into  an  image 
composed  of  linearly  connected  points.  The  purpose  of  applying  this  process  prior  to 
recognition  is  twofold:  to  reduce  the  number  of  black  points  by  repeatedly  peeling 
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off  boundary  points  until  only  the  centerline  remains,  and  to  define  node.  The 
important  clues  for  recognition  of  line  type  are  geometrical  and  topological  features. 
The  theoretical  ideas  of  obtaining  general  features  such  as  common  ridges,  touching, 
fin,  and  bridge,  etc.  were  discussed  intensively  by  an  approach  using  chain  code 
[Tou80]. 

A blob  type  of  trademark  should  be  treated  in  a different  feature  selection 
unlike  the  recognition  of  line  type.  The  contour  representing  function  of  each  piece 
of  a trademark  is  used  to  extract  the  shape  information  of  a blob-type  trademark. 
The  Fourier  transformation  technique  is  applied  to  extract  the  shape  information. 
The  ratio  of  the  corresponding  Fourier  coefficients  for  the  unknown  trademark  and 
the  model  object  represents  the  derivation  of  the  ratio  chosen  as  the  difference 
measurement. 

4.3.4  The  Symbol  Matching  Process  by  Transformation  to  the  Graph  Model 

Symbol  matching  is  a one-to-one  correspondence  to  itself  between  the  same 
symbols  to  be  seen  at  different  times  tx,  t2.  Template  matching  as  a recognition 
scheme  for  symbols  is  inefficient  due  to  the  properties  of  image  data  as  the  number 
of  symbols  grows.  Storing  the  graphics  image  into  computer  as  raw  data  requires  an 
extremely  large  memory  space;  it  is  thus  necessary  to  transform  the  raw  data  into 
compressed  computerized  form.  The  simple  one-to-one  correspondence  between  the 
feature  sets  for  geometrical  and  topological  features  is  usually  not  enough  to 
recognize  the  symbols.  The  relationships  between  the  features  can  increase  the  rate 
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of  recognition.  However,  excessive  extraction  of  relationships  for  all  the  objects 
would  be  prohibitively  expensive  in  documents  with  lots  of  graphics  images. 
Minimum  search  tree  can  avoid  excessive  effort  for  extracting  additional  features.  In 
other  words,  a tree  structured  description  greatly  reduces  the  processing  time. 

The  planar  graph  is  considered  the  appropriate  way  to  computerize  the 
graphics  image  using  matrix  representation.  The  skeleton  of  a line  type  symbol  can 
be  represented  by  the  planar  graph  model  whose  nodes  represent  interesting  points 
in  the  skeleton.  The  edge  of  a graph  model  simply  represents  the  line  component  of 
the  skeleton  in  a graphics  image.  This  graph  model  loses  some  information  of  the 
original  graphics  image.  The  graph  model  simply  represents  the  relationship  of  the 
line  components  for  the  skeletonized  symbol.  While  the  graph  model  is  convenient 
for  showing  the  relationship,  a matrix  representation  is  a convenient  and  useful  way 
of  representing  a graph  to  a computer.  The  recognition  of  a symbol  represented  by 
the  graph  model  then  is  converted  to  the  problem  of  isomorphism  of  the  matrix  for 
symbols.  Two  graphs  are  said  to  be  isomorphic  if  there  is  a one-to-one 
correspondence  between  their  vertex-sets  which  preserves  the  adjacency  of  vertices. 
The  representation  of  skeletonized  symbols  in  graph  models  can  be  used  to  find  the 
most  similar  symbols.  However,  the  brute  force  matching  process  for  graph  models, 
finding  out  isomorphic  graphs  of  objective  symbols,  leads  to  permutations  of  the 
existing  graph  models  for  symbols. 

The  weighted  graph  matching  includes  the  graph  isomorphism  problem 
[Ume88].  A weighted  graph  G is  an  ordered  pair  (V,w)  where  V is  a set  of  vertices 
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of  the  graph  and  w is  a weighting  function  which  gives  a real  nonnegative  value  wfy, 
Vj)  to  each  pair  of  vertices  (Vj,Vj),  v^V,  VjGV,  and  v^Vj.  The  adjacency  matrix  of  a 
weighted  graph  G(V,w)  is  an  n x n matrix  defined  as  follows: 


Ac  - [a 


a{j  = 

ai  x ~ _ n 


l * J 


au  = 0 


(4.2) 


The  n x n matrix  Aq  is  symmetric  when  G is  an  undirected  graph.  The  graph 
matching  is  the  problem  of  finding  a one-to-one  correspondence  $ between 
Vi  = {vi,v2,  ...  ,vn)  and  V2  = {v’1,v’2,  ...  ,v’n}  which  minimizes  a difference  between  G 
and  H which  are  G = (Vj,Wj)  and  H = (V2,w2)  with  n vertices  respectively.  The 
criterion  for  a measure  of  difference  is  defined  as  follows: 

n n 

J(4>)  = Y,Y,  (Wl(V/»V;)  - VV2($( V)MVj)))2  (43) 

<= 1 J-l 

J(<I>)  can  be  reformulated  as  follows  by  using  a permutation  matrix  p if  G and  H are 
weighted  graphs  and  Ac  and  AH  are  their  adjacency  matrices,  respectively. 

J(p)  = II  pAcp7  - Ah  II 2 (4.4) 

where  the  permutation  matrix  p represents  the  vertex  correspondence  O and  II  • II 
is  the  Euclidean  norm. 

The  weighted  graphs  G and  H are  called  isomorphic  if  there  exists  a one-to- 
one  correspondence  $ which  makes  J($)  equal  zero.  That  is,  from  (4.4) 

pAGpT  = AH  (4.5) 

Now,  the  graph  matching  problem  is  to  find  the  permutation  p which  satisfies  (4.5). 
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However,  there  doesn’t  usually  exist  the  permutation  matrix  P in  real  problem.  The 
optimum  matching  between  G and  H is  a permutation  matrix  p which  minimizes  J(p) 
in  (4.4).  It  is  difficult  to  find  the  permutation  matrix  p directly.  If  we  extend  the 
domain  of  J to  the  set  of  orthogonal  matrices,  the  optimum  matching  between  G and 
H is  to  find  orthogonal  matrix  Q which  minimize  J(Q).  This  extension  of  the  domain 
is  natural  because  a permutation  matrix  is  a kind  of  orthogonal  matrix.  These  sets 
of  orthogonal  matrices  are  given  by 

Q = U^UZ  , S e bl  (4.6) 

assuming  eigendecompositions  of  A c and  AH  as 

\ - tfcA  GUl  <4-7> 

- «M'  <4-8> 

In  (4.6),  6j  represents  { diag  (s„  s2,  ... , sn)  I s;  = 1 or  -1  }. 

Now  assuming  that  G and  H are  isomorphic,  the  following  formula  (4.9)  is  obtained 
from  (4.5),  (4.7),  and  (4.8). 

PU<AcUtPT  = K/A'  <4-9) 

Thus, 

PUG  = U„S,  S e 6,  (4.10) 


since  the  eigenvectors  of  a matrix  are  uniquely  determined,  except  for  their  positive 
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and  negative  directions  when  all  eigenvalues  are  distinct.  Then, 

P = U^UI  (4-n) 

This  means  that  there  exists  some  Sefi!  which  exactly  makes  Q a permutation  matrix 
when  G and  H are  isomorphic. 

However,  it  is  not  easy  to  find  diagonal  matrix  S.  This  diagonal  matrix  can  be 
written  as  S',  and  P'  = UHS'UGT.  Let  UH  = [hij],  UG  = [gy],  S'  = diag(Sj).  Then  we  have 

«iP'Tu^vrc)  -EE*,*,  <4-12> 

i=l  k=l 


Obviously,  the  following  holds 
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Sk  h ik  ^ 
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Sk  h ik  I 
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I Sk  h ik  &*(i)Jk  I 


Thus, 
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■E 


I ^ik  I I £x(QJt  I 


(4.13) 


rr  (plTU^UTG)  s £ E I K I I 8m 


j=i  *=i 


= tr  (p/TUhUTg) 


(4.14) 


Since  the  length  of  each  row  vector  of  UH  and  UG  is  equal  to  1 and  the  values  of  its 
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elements  are  nonnegative,  each  element  of  UHUGT  is  greater  than  or  equal  to  0 and 
less  than  or  equal  to  1.  Thus,  we  have 


tr  (P^X)  * » 


(4.15) 


On  the  other  hand, 


tr  (p,TU^'U^  = tr  (pTp)  = n (4.16) 

This  means  that  p maximizes  tr(pTUHUGT)  since  tr(pTUHUGT)  n for  any  permutation 
matrix  p.  Therefore,  when  G and  H are  isomorphic,  the  optimum  permutation  matrix 
can  be  obtained  as  a permutation  matrix  p which  maximize  tr(pTUHUGT). 

For  example,  the  adjacency  matrices  Aq  and  AH  of  planar  graph  G and  H 
shown  in  Figure  4-10  are  given  in  (4.12)  and  (4.13) 


^G 


0 5 8 6 

5 0 5 1 
8 5 0 2 

6 12  0 


(4.17) 


0 18  4 
10  5 2 
8 5 0 5 
4 2 5 0 


(4.18) 
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Figure  4-10.  An  example  of  the  weighted  planar  graph  matching. 


The  characteristic  polynomial  of  Aq  is  det(AG  - XI)  = (A.  - 14.2488)(X  + 0.28)(X  + 
4.8247)(A  +9.1441).  Eigendecompositions  of  Aq  and  AH  are  given  as  follows: 


= u<agutg 


(4.19) 
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14.2488 

0 

0 

0 

0 

-0.28 

0 

0 

0 

0 

-4.8247 

0 

0 

0 

0 

-9.1441 

0.6142 

0.1409 

-0.1822 

-0.7548 

0.4336 

-0.5276 

0.7262 

0.0791 

0.5484 

-0.2700 

-0.5820 

0.5363 

0.3660 

0.7930 

0.3173 

0.3693 

(4.20) 


(4.21) 


= ^X 


(4.22) 


13.2567  0 0 0 

0 -0.7744  0 0 

Ah  = 

H 0 0 -3.4341  0 

0 0 0 -9.0481 

0.5383  -0.4358  -0.4247  -0.5383 
0.3439  0.8801  -0.0185  -0.3269 

U„  = 

H 0.6242  0.0255  -0.2503  0.7396 
0.4498  -0.1867  0.8699  -0.0787 

From  these,  there  exists  a permutation  matrix  P which  satisfy  (4.11). 


(4.23) 


(4.24) 


P = 


0 0 10 
0 0 0 1 
10  0 0 
0 10  0 


(4.25) 
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4.4  Summary 

This  chapter  has  covered  the  recognition  system  for  document  text  including 
nontext  such  as  graphics  and  line  drawings.  The  text  recognition  system  consists  of 
three  stages:  preprocessing,  feature  selection  followed  by  matching,  and 
postprocessing.  Each  character  read  by  an  optical  scanner  is  separated  by  a word 
segmentation  algorithm  using  eight  connected  components.  The  eight  connected 
components  generated  the  circumscribing  rectangles  for  each  individual  character. 
The  segmentation  of  closely  printed  characters  was  resolved  by  combining 
segmentation  with  classification  by  means  of  an  adaptive  decision  tree.  In  this 
method,  the  supervisory  routine  took  control  of  the  window’s  width  and  location 
recursively.  The  viewing  window  was  narrowed  until  either  the  truncated  array  was 
successfully  recognized  or  the  window  became  so  narrow  that  search  was  given  up. 

As  feature  selection  for  character  recognition,  geometrical  properties  were 
mainly  considered  due  to  insensitivity  to  variations  and  proficiency  for 
implementation.  To  increase  recognition  accuracy,  contextual  information  was 
utilized.  Several  methods  such  as  one  based  on  compound  decision  theory,  a Markov 
process,  etc.,  were  introduced  for  a contextual  word  recognition  system. 

In  the  second  part  of  this  chapter,  we  discussed  the  recognition  system  for 
nontext  regions.  First  of  all,  nontext  regions  were  classified  into  two  different  types. 
Two  different  feature  selection  methods  were  applied  to  each  type  of  image  data. 
Prior  to  applying  a feature  extraction  algorithm,  image  data  type  was  classified  by  the 
removal  of  line  segments.  In  the  understanding  of  line  type  nontext,  the  geometrical 


109 


feature  was  extracted  for  a graphics-recognition  and  interpretation  system.  The 
graphics-recognition  and  interpretation  system  mainly  consisted  of  image  processing 
and  a pattern  recognition  algorithm.  For  an  imbedded  text  string  of  nontext,  the  text 
string  image  was  separated  from  a pure  graphics  image  through  the  high  resolution 
binary  input.  Line  segments  were  analyzed  to  find  closed  loops  that  formed  graphical 
primitives  of  known  shapes.  Meanwhile,  the  boundary  information  was  mainly  used 
for  recognition  of  blob-type  images. 

Matrix  representation  of  graphics  images  through  graph  models  induced  the 
matching  problem  of  the  isomorphism  problem.  The  adjacency  matrix  of  a weighted 
graph  was  utilized  to  find  a one-to-one  correspondence.  The  optimum  matching 
between  matrices  was  considered  as  a solution  to  the  weighted  undirected  graph 
matching  problem. 


CHAPTER  5 

DOCUMENT  FILING  AND  RETRIEVAL 
5.1  Introduction 

The  recognized  and  interpreted  documents  should  be  handled  in  electronic 
format  to  ease  their  further  handling.  This  section  will  describe  only  the  framework 
for  doing  this.  The  electronic  document  is  considered  as  a source  of  information,  as 
a learning  device,  and  as  a mechanism  for  communication  between  people  who  are 
distant  in  time  or  place.  As  workstations  grow  cheaper,  more  powerful,  and  more 
available  by  the  advance  of  computer  technology  electronic  documents  will  be  the 
normal  means  of  communication.  Some  sort  of  file  management  is  required  to  enable 
users  to  manage  such  documents.  A computerized  office  requires  an  office  system 
with  powerful  and  integrated  facilities.  Proper  integration  of  these  facilities  is  an 
important  task  requiring  unifying  concepts  that  can  be  used  to  tie  together  diverse 
physical  capabilities.  This  chapter  will  focus  on  the  document  preparation, 
communication,  and  management  aspects  of  office  systems  which  are  referred  to  as 
a document  management  system,  that  allow  the  user  to  group  documents  and  to  file 
these  groups. 

The  objects  in  a document  management  system  are  the  resources  that  people 
require  to  prepare,  communicate,  and  manage  documents.  These  include  the 
documents  themselves,  document  repositories,  printing  facilities,  etc.  These  system 
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resources  are  manipulated  by  office  workers  playing  various  office  roles  and  using 
various  system  facilities.  There  is  also  a growing  interest  about  office  information 
systems  that  handle  complex  data  such  as  text,  attributes,  graphics  images,  and 
picture  images.  They  are  composed  of  text,  attributes,  and  image  information.  Some 
of  the  functions  that  these  systems  may  provide  are  creation  and  filing  of  such 
information,  content  addressability  of  computerized  documents,  automatic  insertion 
of  documents  in  a paper  form,  and  computerized  document  transmission. 

5.2  System  Resources 

Various  resources  are  required  for  a document  management  system.  A set  of 
generic  objects  which  represent  the  available  system  resources  includes  documents, 
file  folders,  a terminal,  file  cabinets,  a printer,  etc.  The  objects  are  grouped  by 
functionality  and  category.  Functionality  describes  the  functions  of  objects:  either  they 
bear  data,  provide  a service  such  as  acting  as  repositories  for  data  objects,  or  perform 
specific  functions  on  data  objects.  One  of  the  data  bearing  objects  is  the  document. 
The  document  will  be  further  described  in  the  following  sections. 

The  basic  information  carrying  entity  in  an  office  automation  system  is  the 
document.  The  other  objects  in  the  system  provide  various  facilities  for 
communicating  and  managing  documents.  The  most  important  aspects  of  the 
document  object  in  a document  management  system  are  document  structure  and 
contents.  First,  the  types  of  data  that  can  constitute  the  contents  of  a document  and 
the  structuring  of  documents  will  be  discussed.  Second,  the  types  of  constraints  that 
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can  be  specified  both  on  the  document  contents  and  on  the  document  structure  will 
be  considered. 

5.2.1  Document  Contents 

A document  is  defined  as  anything  that  can  be  used  to  communicate 
information.  In  a paper  environment,  anything  that  is  written  on  paper  can  be  a 
document.  Thus,  a document  management  system  must  support  at  least  text  and 
attribute  data  types,  where  attribute  data  types  are  the  traditional  data  types 
supported  in  programming  languages  and  data  base  management  systems. 
Meanwhile,  there  are  other  ways  to  communicate  information,  and  these  can  also  be 
regarded  as  potential  constituents  of  documents.  However,  these  other  ways  will  not 
be  discussed  in  this  section  because  they  are  beyond  the  scope  of  this  dissertation. 
Computerized  documents  are  very  important  for  office  automation.  To  support 
computerized  documents,  hardware  facilities  must  be  available  for  handling  the 
different  types  of  data.  In  addition,  user  level  facilities  such  as  editing  must  be 
provided  for  the  different  data  types.  In  a computerized  document  system, 
capabilities  should  be  provided  for  presentation  of  computerized  documents.  A 
document  formatter  should  combine  attributes,  text,  pictures,  and  graphics  images 
that  are  easy  to  use.  The  formatter  may  use  existing  information  in  the  system.  Thus, 
information  extraction  from  documents  stored  in  the  computerized  document  system 


is  needed. 
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5.2.2  Structure  of  Computerized  Documents 

The  logical  components  of  a computerized  document  are  illustrated  in  Figure 
5-1.  Computerized  documents  are  composed  of  one  or  more  of  the  following:  a set 
of  attributes,  a text  part,  and  a set  of  images.  They  may  also  have  an  annotation  part. 
The  document  type  contains  minimal  common  information  in  a large  number  of 
documents.  The  text  part  is  composed  of  text  sections.  Each  text  section  is  composed 
of  paragraphs,  and  each  paragraph  is  made  up  of  words,  and  so  on.  This  structuring 
of  the  text  document  allows  queries  to  restrict  retrieval,  on  the  basis  of  the  proximity 
of  words  within  the  text  document,  as  well  as  to  associate  annotation  with  each  of  the 
text  components.  Attributes  have  an  attribute  name  and  a value.  The  value  may  be 
a repeating  group  of  values.  An  image  is  composed  of  an  image  type,  a vector  form, 
a raster  form,  and  a text  part. 

The  vector  form  represents  the  image  as  a set  of  image  objects  which  are 
represented  as  a set  of  ordered  points  and  a set  of  parameter  values.  Points  are  pairs 
of  values  indicating  the  position  of  a point  within  an  image.  Points  may  be  connected 
to  form  lines,  polygons,  polylines,  etc.  Image  objects  may  be  hierarchically  structured. 
In  other  words,  regions  may  contain  other  regions,  polylines,  or  text. 

The  raster  form  represents  the  image  as  an  ordered  set  of  pixels  in  two 
dimensions.  The  raster  form  of  an  image  may  contain  overlapping  raster  objects, 
which  are  sets  of  adjacent  pixels.  Each  raster  object  corresponds  to  a distinct  vector 
object  of  the  same  picture,  which  is  a closed  polygon.  The  object  caption  is  composed 
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Figure  5-1.  Computerized  document  structure. 
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Figure  5-1  ( Continued  ) 
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of  object  caption  words.  Object  caption  words  are  of  the  type  text,  and  are  composed 
of  words  or  parts  of  words. 

The  image  text  part  is  composed  of  image  text  words.  Image  text  words  are 
composed  of  parts  of  words.  The  image  text  part  is  text-related  to  a given  image.  The 
text  part  is  formed  by  (1)  the  image  caption  of  a given  image,  (2)  text  paragraphs 
related  to  the  image,  (3)  object  caption  words  of  objects  within  the  image,  and  (4) 
text  annotation.  Annotation  is  a further  informal  explanation  about  the  contents  of 
a document,  paragraph,  word,  or  image.  It  may  be  associated  with  a text  document, 
text  section,  text  paragraph,  text  word,  and  an  image. 

5.2.3  Internal  and  External  Representation  of  Documents 

A document  management  system  should  support  the  categorization  of 
documents  according  to  their  type  in  order  to  facilitate  the  management  of 
documents  efficiently.  This  implies  that  all  documents  with  the  same  content 
structure  belong  to  the  same  document  type.  Not  only  does  this  make  management 
of  documents  easier,  but  it  also  facilitates  the  incorporation  of  more  advanced  office 
automation  functions.  The  representation  of  document  types  and  instances  can  be 
divided  into  two  levels:  the  external  representation  and  the  internal  representation. 
The  external  representation  may  be  different  from  the  internal  representation  of  the 
document  in  order  to  allow  for  better  secondary  storage  and  communication 
bandwidth  utilization.  In  other  words,  the  external  representation  is  concerned  with 
what  users  see,  how  they  see,  and  how  they  use  what  they  see.  Meanwhile,  the 
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internal  representation  captures  all  the  information  of  the  external  representation  in 
an  internal  data  structure.  This  data  structure  is  transparent  to  the  user  and  stores 
the  documents  for  future  use.  A sophisticated  internal  data  structure  for  a document 
is  required  to  facilitate  retrieval.  The  internal  representation  of  an  image  does  not 
need  both  an  object  form  and  a raster  form,  but  many  have  only  one  of  the  two. 

A photograph  where  objects  have  been  identified  and  stored  in  the  object 
form  is  an  instance  of  an  image  in  which  both  forms  exist  in  the  internal 
representation.  An  example  of  an  image  having  only  a raster  internal  representation 
is  an  uninterpreted  photograph.  An  image  with  only  an  object  form  as  internal 
representation  can  be  an  engineering  design.  However,  the  object  form  at  the 
external  representation  level  may  be  used  to  display  the  design  in  a raster  display. 
The  internal  representation  of  the  object  form  of  an  image  is  a collection  of  objects. 
With  each  object  is  stored  information  related  to  its  type  such  as  a polygon  or  a 
circle,  its  name,  shading  information,  the  coordinates  of  a set  of  points,  and  name 
display  specifications  such  as  font,  size,  and  position  of  display. 

The  internal  representation  of  statistical  type  images  such  as  graphs, 
histograms,  and  tables  is  a collection  of  tables.  The  information  about  the  objects 
composing  the  presentation  of  these  images  in  a specific  device  is  also  maintained. 
The  duplication  of  statistical  type  images  is  not  very  large,  and  the  approach 
facilitates  both  answering  queries  on  the  image  contents  and  presenting  the  image 
in  a different  form,  or  in  the  same  form  but  with  different  parameters.  In  addition, 
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it  can  be  used  to  display  the  contents  of  the  image  in  devices  which  do  not  have 
graphics  or  bit-map  display  capability. 

The  external  representation  of  a computerized  document  in  an  output  device 
will  be  called  a physical  document.  Some  default  information  is  used  for  displaying 
the  document  in  an  output  device.  Here,  default  information  refers  to  font,  size,  line 
spacing,  etc.  Figure  5-2  shows  the  structure  of  a physical  document.  The  physical 
document  is  divided  into  physical  pages.  Each  physical  page  is  composed  of 
rectangles.  A rectangle  can  be  a text  rectangle  or  an  image  rectangle.  Rectangles  are 
identified  by  their  location  within  a physical  page  and  their  size.  Image  rectangles  are 
isomorphic  to  images  of  a computerized  document,  and  text  rectangles  contain 
information  that  is  used  for  displaying  documents  in  an  output  device. 

A descriptor  is  associated  with  each  created  computerized  document.  The 
descriptor  indicates  the  part  of  the  document,  the  internal  form  for  each  part,  and 
its  mapping  to  a physical  document.  Compressed  information  may  also  be  encoded 
in  the  document  descriptor.  The  compression  method  to  be  used  in  such  an 
environment  depends  also  on  the  system  workload  and  the  devices  used.  In  addition, 
since  there  may  be  a variety  of  techniques  that  can  be  used,  the  particular  method 
used  and  its  parameters  may  be  encoded  within  the  descriptor.  This  may  be  more 
important  for  the  image  part  than  for  the  text  or  attribute  parts  of  computerized 
documents,  due  to  the  large  number  of  bits  in  images.  The  external  representation 
of  a document  type  is  defined  by  one  or  more  document  templated.  A document 
template  specifies  (1)  the  background  information  for  the  document  template,  (2)  the 
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Figure  5-2.  Physical  document  structure. 
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layout  of  the  document  fields  on  the  document,  (3)  the  content  of  the  document 
fields.  The  document  templates  and  document  contents  in  any  appropriate  data 
structure  can  be  represented  as  a relation  in  a relational  data  base  management 
system. 

5.2.4  Information  Extraction  in  Documents 

To  achieve  better  storage  utilization  and  to  enhance  content  addressability, 
information  extraction  from  the  document  is  required.  In  executing  the  information 
extraction,  recognition  of  text  is  the  primary  concern  in  document  management. 
Automatic  recognition  of  text  has  been  successful  for  a variety  of  fonts.  This  success 
can  be  applied  to  documents  to  extract  the  document  parts  and  the  information  that 
is  necessary  for  reconstructing  them.  Such  information  will  be  stored  in  the  document 
descriptor.  For  a large  repository  of  information,  pattern  recognition  takes  place  once 
per  document  and  not  for  every  query.  In  other  words,  information  is  extracted  from 
the  bit-maps  at  document  insertion  time,  using  an  information  extraction  subsystem, 
and  is  stored  with  the  document.  Region  expansion  techniques  are  applied  to  the  bit- 
map in  order  to  extract  information  about  the  dominant  regions  of  the  bit-map.  The 
region  expansion  technique  picks  a threshold  that  divides  the  image  pixels  into  either 
objects  or  background.  Some  well  known  ways  to  pick  the  threshold  were  discussed 
earlier  in  the  first  part  of  Chapter  2.  Assuming  the  image  is  bimodal,  the  threshold 
will  be  the  minimum  value  to  separate  the  two  peaks.  However,  when  the  histogram 
is  not  a smooth  function,  it  can  be  difficult  to  find  the  right  valley  between  the  peaks 
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of  a histogram.  An  elegant  method  for  treating  bimodal  images  assumes  that  the 
histogram  is  the  sum  of  two  composite  normal  functions  and  determines  the  valley 
from  the  normal  parameters.  The  single-threshold  method  is  useful  in  simple 
situations,  but  is  a problem  when  collections  vary  smoothly  in  gray  levels  by  more 
than  the  threshold.  To  improve  the  difficulty  above,  Chow  and  Kanako  [Cho73] 
modified  the  threshold  approach  by  deemphasizing  the  low-frequency  background 
variation  in  the  original  technique  or  using  a spatially  varying  threshold  method. 
Their  technique  divides  the  image  into  rectangular  subimages  and  computes  a 
threshold  for  each  subimage.  A subimage  can  fail  to  have  a threshold  if  its  gray-level 
histogram  is  not  bimodal.  Such  subimages  receive  interpolated  thresholds  from 
neighboring  subimages  that  are  bimodal,  and  finally  the  entire  picture  is  thresholded 
by  using  separate  thresholds  for  each  subimage. 

Split  and  merge  techniques  can  then  be  applied  to  decide  the  final  set  of 
regions.  The  technique  is  most  successful  in  defining  dominant  regions  [Hor74]. 
Further  segmentation  of  the  picture  will  probably  require  knowledge  of  the  content 
of  the  picture  and  cannot  be  done  easily  with  general  purpose  techniques.  Special 
types  of  regions  which  are  often  encountered  in  the  office  environment  are  needed 
to  be  recognized  in  a picture.  Such  regions  are  surrounded  by  a boundary  and  are 
categorized  into  square  regions,  parallelogram  regions,  circle  regions,  and  ellipsoidal 
regions,  depending  upon  the  shape  of  the  surrounding  boundary.  There  are  two 
reasons  for  recognizing  these  special  types  of  regions.  First,  it  can  reduce  data  size. 
For  instance,  a circle  requires  storage  of  its  center  and  radius.  Second,  it  can  increase 


122 


the  speed  of  content  addressability  since  not  all  regions  have  to  be  examined  to  see 
if  they  satisfy  special  properties.  User-defined  regions  are  stored  in  an  image 
dictionary  with  a special  code  name  and  anchor  points  in  the  dictionary.  These  user- 
defined  regions  from  the  graphics  editor  can  be  used  to  place  a copy  of  the  region 
in  question  within  an  image  using  the  anchor  points.  The  user  can  insert  new  user 
defined  regions  into  the  dictionary  at  any  point  in  time.  The  search  of  the  dictionary 
can  be  done  by  text  words  attached  to  the  definition  of  the  region.  The  search  for  the 
images  that  contain  a user-defined  region  can  be  done  by  the  code  name  of  the 
region  for  images  created  within  the  system.  Information  describing  the  region  is  also 
extracted  and  stored  with  the  definition  of  the  region. 

Region  parameters  are  associated  with  each  region,  and  their  parameter 
values  are  extracted  after  the  segmentation  of  the  bit-map  into  regions.  The  user  can 
specify  certain  images,  using  the  defined-image  dictionary,  or  extract  a region  from 
an  image  that  he  has  seen  while  browsing  through  images  of  the  system,  or  draw  the 
image  that  he  wants  in  his  screen.  The  system  will  extract  parameters  describing  the 
specified  image.  The  system  will  try  to  match  the  parameters  of  the  defined  region 
with  the  parameters  of  the  images  in  the  system. 

Polylines,  which  are  collections  of  connected  line  segments,  and  image  text 
must  be  handled.  User-defined  polylines  are  named  polylines  and  are  stored  in  the 
defined-image  dictionary  for  reasons  of  compression  and  content  addressability,  as 
was  the  case  with  user-defined  regions.  Such  polylines  may  represent,  for  instance, 
registers,  capacitors,  or  more  complicated  circuits.  A polyline  descriptor  abstracts  the 
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global  characteristics  of  the  polyline  and  allows  retrieval  based  on  the  similarity  of 
two  polylines.  Several  regions,  polylines,  and  text  may  be  hierarchically  structured. 
In  other  words,  a region  may  contain  other  regions  or  polylines  or  text. 

5.3  System  Facilities 

The  document  management  system  requires  facilities  for  manipulating  the 
system  resources.  Two  different  notions  of  environment  are  needed  in  a document 
management  system:  a general  environment  which  provides  common  or  frequently 
used  facilities  that  are  accessible  to  all  users,  and  application-specific  environments 
which  provide  facilities  that  are  restricted  to  more  knowledgeable  users  or  are  of  less 
global  interest.  Of  the  two  different  environments,  the  facilities  of  the  general 
environment  will  be  described.  The  document  management  system  should  include 
some  basic  environments  to  facilitate  the  tasks.  These  include  editing,  formatting, 
filing,  and  retrieving.  This  environment  should  provide  a way  to  access  the 
application-specific  environments.  Like  all  the  environments,  the  general  environment 
must  be  concerned  with  providing  a uniform  and  consistent  interface  to  all  the 
facilities  available  within  the  environment.  Providing  a generic  set  of  operations 
across  all  the  objects  is  considered  to  be  one  of  the  ways  to  achieve  integration  of 
facilities  within  an  environment.  Integration  of  facilities  is  achieved  by  commonality 
of  effect,  as  far  as  the  user  is  concerned. 

A document  is  an  object  composed  of  more  primitive  objects.  Each  object  is 
an  instance  of  a class  that  defines  the  possible  constituents  and  representations  of  the 
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instances.  Some  document  classes  are  business  letters,  papers  for  a particular  journal 
or  conferences,  theses,  and  programs  in  a given  language.  Objects  are  further 
classified  as  either  abstract  or  concrete.  An  abstract  object  is  denoted  by  an  identifier 
and  the  class  to  which  the  object  belongs.  One  or  more  concrete  objects  corresponds 
to  each  abstract  object.  Concrete  objects  are  defined  on  two-dimensional  pages  and 
represent  the  possible  formatted  images  of  abstract  objects.  For  example,  a particular 
paragraph  of  a document  or  an  abstract  paragraph  object  may  be  represented 
concretely  in  many  different  ways  depending  on  font,  hyphenation  conventions,  line 
length,  and  other  concrete  variables. 

Document  processing  consists  of  executing  various  operations  to  define  and 
manipulate  abstract  and  concrete  objects.  There  are  two  distinguishable  concepts  for 
objects;  ordered  and  unordered.  Many  textual  objects,  such  as  paragraphs  and  words, 
are  normally  ordered,  implying  that  we  can  speak  of  the  first  one,  the  last  one,  the 
next  one,  the  preceding  one,  and  so  on.  On  the  other  hand,  there  are  many  objects 
that  are  more  naturally  treated  as  unordered  for  particular  applications  such  as 
figures,  tables,  parts  of  mathematical  equations,  and  pieces  of  unrelated  text. 

5.3.1  Editing 

Editing  is  a requisite  facility  for  the  document  management  system.  Editing 
operations  are  defined  as  mapping  from  either  abstract  to  abstract  objects  or 
concrete  to  concrete  objects.  Conventional  text  editing  operations  map  logical  text 
objects  to  logical  text  objects.  For  example,  a text  insertion  or  deletion  may  be  a 
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mapping  from  strings  to  strings  or  from  paragraphs  to  paragraphs.  An  editing  facility 
usually  requires  tools  to  manage  the  document  with  different  data  types  because  it 
may  contain  several  data  types.  Not  only  does  the  editor  need  word  processing  for 
text  editing,  but  also  a geometric  editor  for  structured  graphics  design  and  a 
paint/bit-map  editor  for  free-hand  drawing  of  digitized  images.  A single  set  of 
operations  usable  for  the  editing  of  all  data  types  has  been  studied  to  reduce  the 
amount  of  detail  characteristic  of  existing  multi-packaged  document  preparation 
systems  [Fur82]. 

A fully  integrated  editing  facility  is  difficult  to  achieve  without  a uniform 
framework  for  handling  different  data  types.  To  improve  this  shortcoming,  the  boxes- 
and-glue  approach  is  proposed.  This  approach  uses  two-dimensional  objects,  called 
boxes,  that  encase  concrete  entities  such  as  characters,  words,  lines,  paragraphs  and 
pages.  Reference  points  of  boxes  which  are  variable  in  size  are  used  to  align  them. 
The  content  of  a document  can  be  constructed  from  a collection  of  boxes  whose 
contents  may  contain  only  one  type  of  data.  To  insert  information  into  a box,  the 
appropriate  box  is  selected,  positioned  and  sized.  The  type  of  box  defines  the 
appropriate  editor  to  be  invoked. 

5.3.2  Formatting 

Since  we  are  able  to  categorize  documents  as  to  type,  it  seems  natural  to 
associate  some  formatting  information  with  each  document  type.  Mappings  from 
abstract  objects  to  concrete  objects  are  defined  as  formatting  operations. 
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Transforming  a logical  character  to  its  representation  in  a particular  font,  producing 
a two-dimensional  word  with  possible  hyphenation  from  a logical  word,  mapping  a 
paragraph  into  a sequence  of  lines,  and  breaking  an  abstract  document  into  pages  are 
considered  as  standard  examples.  Mappings  such  as  those  that  transform  an  abstract 
directed  graph  to  a line  drawing  and  functions  of  constructing  or  laying  out  a table 
from  a list  of  its  entries  are  in  the  nontextual  domain. 

This  formatting  information  is  called  the  document  profile,  and  it  specifies  the 
default  appearance  for  the  document  fields  and  background  information.  In  addition, 
we  may  want  to  change  the  appearance  to  specific  fields  of  document  type.  For  this 
case,  we  need  to  be  able  to  override  the  default  format,  and  to  associate  a different 
format  with  parts  of  all  of  a document  field. 

Most  interactive  formatters  have  a hierarchical  structure  and  inheritance 
scheme  for  the  format  environment  [Fur82].  For  example,  an  extended  abstract  which 
can  be  seen  as  a paper  has  the  logical  objects  defined  and  structured  as  follows: 

< Extended  Abstract  > = ( < Header  >,  <Body>,  < References  > ) 

< Header  > = ( < Title  >,  < Authors  > < Affiliation  > ) 

<Body>  = < Introduction  > < Section  1>  < Section  2 > < Section  3 > 

< Reference  > = ... 

< Title  > = " Knowledge  based  ... " 

< Extended  Abstract  > is  an  instance  of  the  class  of  extended  abstracts 
specified  for  a particular  conference.  The  notation  (A,  B,  ...,  H)  denotes  the 
unordered  set  of  objects  A,  B,  ...,  H;  and  A B ...  H means  that  the  object  sequence 
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A followed  by  B followed  by  ...  followed  by  H.  Thus  the  < Header  > consist  of  the 
object  < Title  > and  the  object  sequence  < Authors  > < Affiliation  >. 

The  format  environment  at  any  point  in  a document  instance  is  the  complete 
set  of  values  that  are  in  force  at  that  point.  The  root  format  environment  of  this 
hierarchy  is  the  document  profile.  In  a particular  format  environment,  the  value  for 
a format  attribute  may  be  undefined.  In  this  case,  the  format  attribute  inherits  its 
value  from  a higher  format  environment.  In  other  words,  the  particular  format 
environment  may  extend  all  the  way  back  to  the  document’s  document  profile. 

5.3.3  Retrieving 

Retrieval  of  a document  at  some  later  point  in  time  is  an  indispensable  facility 
in  a document  management  system.  Typically,  two  types  of  retrieval  patterns  are 
observed.  One  is  the  case  in  which  the  user  is  not  quite  sure  of  what  he  or  she  is 
looking  for.  The  other  is  the  one  where  he  may  have  an  idea  of  what  he  is  looking 
for,  but  tends  to  be  vague  when  he  formulates  his  request.  Retrieval  by  specifying 
document  content  information  instead  of  document  identifier  is  useful  for  content 
addressability.  The  user  will  have  some  idea  of  the  content  of  documents  that  he 
wants  to  see,  and  will  specify  this  information  in  his  query.  The  system  will  try  to 
return  all  relevant  documents  to  him. 

Conditions  on  the  text  part  of  computerized  documents  involve  Boolean 
conditions  of  text  words  or  parts  of  words.  In  some  cases  converting  image 
recognition  problems  to  attribute  and  text  recognition  problems  provides  a powerful 
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alternative.  Image  content  addressability  can  be  achieved  by  specifying  conditions  on 
the  image  text  part  and  the  image  statistical  part,  as  well  as  by  similarity  conditions 
on  image  objects.  Similar  conditions  are  matched  with  the  parameters  of  the  image 
objects.  These  parameters  have  been  extracted  and  stored  at  document  insertion 
time.  Thus,  pattern  recognition  does  not  take  place  at  query  time  with  the  possible 
exception  of  the  extraction  of  information  from  a picture  drawn  by  the  user. 
Retrieving  documents  based  on  conditions  on  an  image’s  text  part  is  different  from 
specifying  conditions  on  the  text  of  the  document.  The  former  specifies  a document 
that  has  an  image  related  to  the  condition  specified.  The  latter  specifies  a document 
related  to  the  condition  specified. 

For  an  image  with  a number  of  statistical  objects  where  each  object  has  an 
internal  representation  in  the  form  of  a table,  the  user  can  focus  his  attention  on  only 
one  of  the  statistical  objects  at  a time.  The  relationships  among  tables  are 
intolerable,  and  conditions  on  tables  may  be  very  selective.  As  a consequence,  the 
size  of  the  response  is  limited.  The  external  representation  of  a document  allows 
more  than  one  statistical  object  to  appear  in  the  same  image.  The  user  can  query 
directly  on  the  image  objects  which  do  not  contain  image  text  and  are  not  of  a 
statistical  type.  The  user  specifies  his  queries  on  images  with  the  help  of  the  graphics 
editor,  the  special  type  images  dictionary,  the  defined-images  dictionary,  and  the 
texture  dictionary. 

The  specification  of  the  query  can  be  done  interactively  by  using  the  image 
editor  to  draw  objects  and  their  structure  relationships.  The  user  can  also  specify  a 
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texture  directly  for  textual  image.  The  user  may  also  want  to  allow  flexibility  about 
objects  that  he  draws.  He  may  indicate  that  rotation,  translation,  scaling  are  allowed. 
If  the  user  is  not  very  confident  about  the  shape  of  the  object,  only  general  measures 
are  examined  for  matching. 

The  system  tries  to  match  the  user  description  of  the  object  with  the 
descriptions  of  the  stored  objects.  A similarity  measure  is  computed,  and  images  with 
similar  objects  are  returned  to  the  user.  The  system  also  indicates  to  the  user  which 
object  was  qualified  from  a given  image,  so  that  the  user  is  able  to  see  a possible 
error  or  omission  in  the  specification  of  his  query.  If  one  or  the  other  occurs,  he  may 
want  to  further  edit  the  image  of  his  query  or  he  may  want  to  redefine  the  image. 
Region  expansion  techniques  can  be  used  to  find  the  dominant  objects  of  an  image. 
Structural  relationships  of  objects  are  hierarchical,  so  detection  of  relationships  is 
easy.  In  the  case  of  a more  specialized  environment  for  a particular  application, 
application-related  techniques  are  desirable. 

5.4  Summary 

In  this  chapter,  we  have  discussed  what  type  of  document  structure  would  be 
desirable  for  computerized  documents  and  what  facilities  it  should  provide.  For  these, 
we  have  presented  the  functionality  of  the  document,  one  of  the  system  resources  of 
document  management  systems.  As  the  most  important  aspect  of  the  document 
object  in  a document  management  system,  document  structure  and  its  content  were 
discussed.  The  representation  of  document  types  and  instances  was  divided  into  two 
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levels,  internal  and  external.  To  design  a computerized  document  is  to  transform  the 
internal  representation  to  the  external  representation.  The  external  representation 
does  not  always  correspond  to  the  internal  representation.  The  internal 
representation  captures  all  the  information  of  the  external  representation  in  an 
internal  data  structure. 

The  document  management  system  requires  two  different  environments  for 
system  facilities.  One  of  these  environments,  called  the  general  environment,  was  the 
main  concern  and  provides  common  or  frequently  used  facilities.  The  basic  facilities, 
such  as  editing,  formatting,  and  retrieving  were  discussed  conceptually.  Among  these, 
retrieval  of  a document  at  some  later  point  in  time  is  necessarily  required  and  is 
considered  as  a main  concern  in  computerizing  documents.  Retrieval  by  specifying 
document  content  information  is  more  useful  than  retrieval  by  document  identifier. 
Retrieval  of  documents  such  as  text  parts,  image  content,  and  an  image  with  a 
number  of  statistical  objects  were  discussed.  The  system  matched  the  user  description 
of  the  objects  with  the  description  of  the  stored  objects. 


CHAPTER  6 
CONCLUSION 

6.1  Discussion 

This  dissertation  addresses  the  problems  concerning  block  segmentation  and 
block  classification  of  a digitized  document.  Systems  that  can  simply  capture,  store, 
print,  and  distribute  images  have  widespread  utility.  A commercially  available 
character  recognition  machine,  which  is  able  to  recognize  textual  information  only, 
can  be  used  as  a subsystem  within  a general  document  analysis  system.  The  more 
advanced  aspects  of  document  analysis,  such  as  automatic  block  segmentation  and 
block  classification,  can  encode  complex  documents  containing  a mixture  of  text, 
graphics  images,  and  pictures.  For  the  block  segmentation  problem,  we  developed  an 
algorithm  which  uses  a global  operator  to  separate  the  blocks  by  the  wide  space 
between  them.  A page  with  double  spacing  is  hardly  seen  in  professional  documents 
such  as  journals,  magazines,  and  books. 

The  ratio  for  the  distance  between  text-lines  to  the  width  of  text  block  is 
usually  less  than  0.02.  For  instance,  the  distance  between  the  text  lines  is  0.1  mm  and 
the  width  of  the  text  block  is  6 cm,  then  the  ratio  above  is  approximately  0.01667. 
Using  the  RLSA  method  the  allowable  skew  angle  for  this  ratio  is  only  0.95  °.  It  is 
unlikely  to  be  able  to  scan  a document  with  this  little  skew.  The  advantages  of  the 
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method  presented  here  over  the  previously  published  ones  are  summarized  as 
follows: 

(1)  The  procedure  does  not  restrict  the  scanning  direction  of  the  document.  That  is, 
it  is  insensitive  to  the  skew  of  the  document  placement  on  the  scanner.  This 
leads  to  much  simplified  processing. 

(2)  Its  time  complexity  function  is  0(n2)  and  the  complexity  is  dependent  upon  the 
global  operations  to  connect  each  black  pixel.  This  complexity  leads  to  fast 
processing  speed. 

For  the  block  classification  problem  of  the  document  image,  the  classification 
rule  which  is  invariant  to  skew  was  utilized.  The  ratio  for  the  black-white  transition 
count  over  the  count  of  black  pixels  separates  graphics  images  from  the  text  block 
or  complex  line  drawing.  Line  segments  were  removed  to  distinguish  text  block  from 
other  blocks.  The  advantages  of  this  method  are  summarized  as  follows: 

(1)  It  is  based  on  pixel  level.  Therefore,  for  text  blocks  the  ratio  defined  above  is 
insensitive  to  the  rotation  of  the  document. 

(2)  It  provides  a straightforward  algorithm  to  separate  text  blocks  from  other  blocks 
with  similar  ratios  by  eliminating  pure  line  segments. 

For  the  recognition  problem  of  each  block,  strong  use  was  made  of  experts  for 
nontext  region  understanding  for  such  applications  as  logic  circuit  diagrams  and 
mechanical  engineering  drawings.  In  identifying  trademarks  and  symbols,  the  type 
decision  was  made  in  order  to  apply  two  different  recognition  methods  depending 
upon  the  type  of  trademarks  and  symbols.  The  conversion  from  thinned  image  to 
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planar  graph  model  was  done  to  simplify  the  identification  problem.  Then  the  planar 
graph  model  was  converted  to  a matrix  representation,  so  that  the  nontext 
understanding  problem  corresponded  to  the  isomorphism  problem  in  matrix 
matching. 

A higher  level  document  understanding,  page-structure  analysis,  was  discussed. 
Document  structure  varies  from  one  type  of  document  to  another.  It  is  not  easy  to 
develop  a general  system  which  is  able  to  analyze  all  types  of  documents 
automatically.  We  have  introduced  one  standard  page  structure.  Page-structure 
analysis  would  be  an  interesting  topic  for  further  study. 

6.2  Contributions 

This  research  provides  a satisfactory  solution  for  automatic  block  segmentation 
and  classification  of  the  digitized  paper-based  documents.  The  key  contributions  of 
this  research  are  summarized  as  follows: 

(1)  Development  of  a fast  and  robust  algorithm  for  block  segmentation.  This 
method  utilizes  a global  operator  to  connect  black  picture  elements  within  a 
same  block  and  is  insensitive  to  skew. 

(2)  Development  of  an  approach  classifying  each  segmented  block  despite  the 
skewed  blocks.  In  the  actual  scanning  process  the  scanning  line  is  unlikely  to 
correspond  to  the  text  line.  This  means  that  the  skewed  block  is  the  likely 
starting  point  for  block  classification.  This  method  makes  use  of  measurements 
which  are  not  affected  by  the  block  shape. 
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(3)  Development  of  a systematic  and  efficient  matching  process.  The  thinned  image 
of  line  type  graphics  image  is  converted  to  a planar  graph  model.  This  method 
identifies  the  line  type  graphics  image  by  the  weighted  graph  matching 
algorithm. 

(4)  Design  of  a system  for  understanding  some  graphics  and  symbols.  The 
professional  document  contains  some  graphics  images  for  easing  reader 
understanding  and  represents  a current  trend.  Our  process  divides  the  graphics 
and  symbols  into  either  line  type  or  blob  type,  then  each  type  of  graphics  and 
symbols  is  identified  by  geometrical  features  or  boundary  features  respectively. 


APPENDIX  A 

ROBUSTNESS  OF  THE  NEW  SEGMENTATION  ALGORITHM 


Several  algorithms  for  block  segmentation  have  been  reported.  However, 
many  of  these  algorithms  are  very  restrictive  concerning  the  accurate  placement  of 
documents  on  the  scanner.  As  a robust  approach,  the  image  is  first  processed  to 
determine  the  individual  connected  components.  Individual  characters  and  other 
large  figures  are  connected  at  the  lowest  level  of  analysis.  The  characters  which  are 
near  enough  to  be  within  a certain  distance  are  merged  into  words,  words  are 
merged  into  lines,  lines  are  merged  into  paragraphs,  and  paragraphs  into  even  larger 
blocks,  if  such  a merging  is  possible.  However,  this  bottom-up  approach  requires  long 
processing  time  and  may  not  be  robust.  The  new  segmentation  rule  is  considered  a 
robust  one,  in  other  words  it  can  separate  the  blocks  even  if  the  documents  are 
scanned  with  skew. 

Definition  3.  Let  Dj  and  D2  be  document  images,  and  D2  be  the  document 
image  obtained  by  rotating  Dj  by  a degrees.  Then  denote  this  operation  as 
rota(D1)  = D2. 

Definition  4.  Let  W be  a logical  document  image  OR  summation  at  each 
location  such  that  UNP(k)  = P(k)  + P(k)  + ...  + P(k),  where  + denotes  logical  OR 
operation.  The  operator  fef1  implies  that  the  logical  document  image  OR  summation 
at  each  location  is  applied  on  the  page  image  N-times. 
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Definition  5.  Denote  D2 1=  Dj  if  there  exists  0 ^ B ^ a such  that  D2  = rotB(D1). 

Theorem:  rotB(WilOP1(P(x,y)))HifeiOP1(rotB(P(x,y)))  for  all  (kB^a. 

Proof:  from  definition  5,  there  exists  (kB^a  such  that  rotB  (Wi)  OPj  (P(x,y)))t=  Wi) 
OPj  (rotB(P(x,y))).  From  definition  2,  Wi)  OPj  (P(x,y))  2 rotB  (P(x,y)).  There  exists 
Odka  such  that  rotB(Wi)OPj(P(x,y)))  = l±lOP1(rotB(P(x,y))).  Therefore  from 
definition  5,  rotB  (Wi)  OP!  (P(x,y)))  t=  Wil  OPj  (rotB  (P(x,y))). 

By  the  theorem  above,  the  proposed  approach  separates  each  block  regardless 
of  skewing  of  the  document.  Assume  that  li^  W1^  OPj  (P(x,y))  is  well  segmented. 
Here,  "well  segmented"  means  that  document  is  segmented  by  the  wide  white  space 
which  separates  page  into  blocks.  Then  rotB  (lif1  OPj  (P(x,y))  is  also  well 
segmented  without  loss  of  generality.  By  the  theorem,  rotB  (life)  OPj  (P(x,y)))  t=  life) 
OPj  (rotB  (P(x,y))).  There  exists  (kB^a  such  that  rotB  (life)  OPj  (P(x,y)))  = life)  OPj 
(rotB  (P(x,y))).  Therefore,  Wil  OP!  (rotB  (P(x,y)))  is  also  well  segmented. 


APPENDIX  B 

SOME  METHODOLOGIES  OF  CONTEXTUAL  WORD  RECOGNITION 
The  organization  of  a contextual  word  recognition  system  is  shown  in  Figure 
B-l.  The  input  consists  of  a string  of  machine-printed  characters.  Suppose  the  output 
of  a character  recognizer  is  a word  Xv  X2, ... , Xn  of  length  n.  Suppose  also  that  the 
character  recognizer  has  assigned  to  each  character  a set  of  a labels  with  given 
probabilities.  For  example,  the  probability  that  the  true  label  of  character  Xs  is  0j 
may  be  given  by  P(0jlXj).  The  problem  in  contextual  word  recognition  is  to 
determine  the  combination  of  labels  0j,  ...  , 0n  that  maximizes  the  a posteriori 
probability  P(0j,  ... , 0nlXj  ...  Xn).  To  maximize  a posteriori  probability  based  on  (1) 
compound  decision  theory,  (2)  a Markov  process,  and  (3)  dictionary  look-up. 


Figure  B-l.  Organization  of  a contextual  word  recognition  system 
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Method  Based  on  Optimum  Compound  Decision  Theory 

A posteriori  probability  can  also  be  expressed  as  follows  by  Bayes’  rule, 


P(0p  ...  ,6J  X,  ...  Xn)  = 


P(X{  ...  Xn  | 0, 0„)  P(0„  ...  ,0„) 


(B.l) 


P(Xx  ...  Xn) 

The  independence  of  the  shape  of  printed  characters  in  a word  describes  that  every 
character  of  a sequence  is  classified  on  the  basis  of  the  information  from  the 
character  itself.  The  realistic  assumption  states  that  the  recognizer  behavior  is 
independent  of  previous  decisions,  then 


P(X, ....  xnj  e„  ...  ,e„)  = n P(x,\0)  - n 


M 


i=l 


P(X.)  i>(0jX.) 
P&i) 


(B.2) 


From  the  two  expressions  above,  the  a posteriori  probability  is  rewritten  as 


/>(e„  ...  ,e„|  x, ...  x,) 


P(e,, ...  ,e.)  • p(x,>  p(e,ix,) 
P(Xj  ...  XJ  il  P(8,) 


(B.3) 


Since  the  word  is  given  and  the  a priori  probability  p(0j)  is  known  because  the  a 
priori  probability  of  the  word  p(0lv..,0n)  is  determined  from  the  frequency  of  the 
word,  maximizing  the  a posteriori  probability  corresponds  to  maximizing 


n 


p<o, 9.)  n w,ix,) 

i*l 


(B.4) 


Method  Based  on  a Markov  Process 

A first  order  Markov  process  approximates  the  a priori  probability  as 
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«8, e„)  = Wo)  n «e,lfl,-i)  (B-5) 

i-1 

where  0O  is  the  label  of  the  character  to  the  left  of  a word,  known  to  be  a space. 
Therefore,  p(0o)  = l.  Assuming  the  process  is  memoryless,  then  we  get 


pw,  ....  x.i  6, o.) - n - n 

M i=l 


PCO^IX.)  P(Qt) 
P(X) 


(B.6) 


If  we  substitute  the  two  expressions  above  into  the  a posteriori  probability,  the  a 
posteriori  probability  is  formed  as 


w,, ...  ,8„|  x, ...  xj  = 


P (X, 


_ A P(0,|X^  P(8,|8,_,)  P(8|) 
X„)  M P(X) 


As  described  in  the  previous  method,  Xj...Xn  and  the  a priori  probability  p(0j)  are 
known.  To  maximize  the  a posteriori  probability,  we  need  to  maximize 

n 

n Piejx,)  w,|eM)  (B.8) 

1=1 

or  equivalently  maximize 

E 1 lo8  W,l*i)  + loS  > (B.9) 

where  p(0;  | represents  the  probability  that  if  the  (i-l)th  character  has  label  0^ 
the  i-th  character  has  the  label  0j.  This  is  known  as  the  transition  probability  and  is 
determined  from  the  underlying  language.  The  maximum  of  the  expression  described 
above  is  found  efficiently  from  an  algorithm  given  by  Viterbi.  The  Viterbi  algorithm 
efficiently  searches  for  the  maximum  of  a sum. 
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Method  Based  on  Dictionary  Look-Up 

The  dictionary  look-up  method  makes  use  of  identity  comparison,  and  three 
distance  functions  optimize  a contextual  postprocessing  system  [Dos77].  This  method 
requires  enough  words  in  the  dictionary  to  find  the  right  entry  for  a given  word.  If 
the  word  does  not  exist  in  the  dictionary,  we  have  to  find  the  word  from  the 
substitution  sets  that  have  the  next  highest  a posteriori  probability  and  check  that 
word’s  validity  against  the  dictionary.  This  method  excludes  the  failure  of  substitution 
sets  with  the  largest  a posteriori  probability  of  representing  the  string.  In  other  words, 
it  excludes  the  case  that  the  word  with  the  highest  a posteriori  probability  is  not  the 
correct  one.  If  the  word  with  the  highest  a posteriori  probability  is  an  illegal  word 
which  does  not  exist  in  the  language,  we  can  easily  detect  it  by  checking  the  word 
against  a dictionary. 
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